udapi.block.ud.id.fixgsd module

Block to fix annotation of UD Indonesian-GSD.

class udapi.block.ud.id.fixgsd.FixGSD(zones='all', if_empty_tree='process', **kwargs)[source]

Bases: Block

fix_ordinal_numerals(node)[source]

Ordinal numerals should be ADJ NumType=Ord in UD. They have many different UPOS tags in Indonesian GSD. This method harmonizes them. pertama = first kedua = second ketiga = third keempat = fourth kelima = fifth keenam = sixth ketujuh = seventh kedelapan = eighth kesembilan = ninth ke-48 = 48th

However! The ke- forms (i.e., not ‘pertama’) can also function as total versions of cardinal numbers (‘both’, ‘all three’ etc.). If the numeral precedes the noun, it is a total cardinal; if it follows the noun, it is an ordinal. An exception is when the modified noun is ‘kali’ = ‘time’. Then the numeral is ordinal regardless where it occurs, and together with ‘kali’ it functions as an adverbial ordinal (‘for the second time’).

fix_plural_propn(node)[source]

It is unlikely that a proper noun will have a plural form in Indonesian. All examples observed in GSD should actually be tagged as common nouns.

fix_satu_satunya(node)[source]

‘satu’ = ‘one’ (NUM) ‘satu-satunya’ = ‘the only’

fix_semua(node)[source]

Indonesian “semua” means “everything, all”. Originally it was DET, PRON, or ADV. Ika: I usually only labeled “semua” as DET only if it’s followed by a NOUN/PROPN. If it’s followed by DET (including ‘-nya’ as DET) or it’s not followed by any NOUN/DET, I labeled them as PRON.

fix_upos_based_on_morphind(node)[source]

Example from data: (“kesamaan”), the correct UPOS is NOUN, as suggested by MorphInd. Based on my observation so far, if there is a different UPOS between the original GSD and MorphInd, it’s better to trust MorphInd I found so many incorrect UPOS in GSD, especially when NOUNs become VERBs and VERBs become NOUNs. I suggest adding Voice=Pass when the script decides ke-xxx-an as VERB.

lemmatize_from_morphind(node)[source]
merge_reduplication(node)[source]

Reduplication is a common morphological device in Indonesian. Reduplicated nouns signal plural but some reduplications also encode emphasis, modification of meaning etc. In the previous annotation of GSD, reduplication was mostly analyzed as three tokens, e.g., for plurals, the second copy would be attached to the first one as compound:plur, and the hyphen would be attached to the second copy as punct. We want to analyze reduplication as a single token. Fix it.

process_node(node)[source]

Process a UD node

rejoin_decades(node)[source]

In Indonesian, the equivalent of English “1990s” is written as “1990-an”. In GSD, it is often tokenized as multiple tokens, which is wrong. Fix it.

rejoin_ordinal_numerals(node)[source]

If an ordinal numeral is spelled using digits (‘ke-18’), it is often tokenized as multiple tokens, which is wrong. Fix it.