udapi.block.ud.id.fixgsd module¶
Block to fix annotation of UD Indonesian-GSD.
- class udapi.block.ud.id.fixgsd.FixGSD(zones='all', if_empty_tree='process', **kwargs)[source]¶
Bases:
Block
- fix_ordinal_numerals(node)[source]¶
Ordinal numerals should be ADJ NumType=Ord in UD. They have many different UPOS tags in Indonesian GSD. This method harmonizes them. pertama = first kedua = second ketiga = third keempat = fourth kelima = fifth keenam = sixth ketujuh = seventh kedelapan = eighth kesembilan = ninth ke-48 = 48th
However! The ke- forms (i.e., not ‘pertama’) can also function as total versions of cardinal numbers (‘both’, ‘all three’ etc.). If the numeral precedes the noun, it is a total cardinal; if it follows the noun, it is an ordinal. An exception is when the modified noun is ‘kali’ = ‘time’. Then the numeral is ordinal regardless where it occurs, and together with ‘kali’ it functions as an adverbial ordinal (‘for the second time’).
- fix_plural_propn(node)[source]¶
It is unlikely that a proper noun will have a plural form in Indonesian. All examples observed in GSD should actually be tagged as common nouns.
- fix_semua(node)[source]¶
Indonesian “semua” means “everything, all”. Originally it was DET, PRON, or ADV. Ika: I usually only labeled “semua” as DET only if it’s followed by a NOUN/PROPN. If it’s followed by DET (including ‘-nya’ as DET) or it’s not followed by any NOUN/DET, I labeled them as PRON.
- fix_upos_based_on_morphind(node)[source]¶
Example from data: (“kesamaan”), the correct UPOS is NOUN, as suggested by MorphInd. Based on my observation so far, if there is a different UPOS between the original GSD and MorphInd, it’s better to trust MorphInd I found so many incorrect UPOS in GSD, especially when NOUNs become VERBs and VERBs become NOUNs. I suggest adding Voice=Pass when the script decides ke-xxx-an as VERB.
- merge_reduplication(node)[source]¶
Reduplication is a common morphological device in Indonesian. Reduplicated nouns signal plural but some reduplications also encode emphasis, modification of meaning etc. In the previous annotation of GSD, reduplication was mostly analyzed as three tokens, e.g., for plurals, the second copy would be attached to the first one as compound:plur, and the hyphen would be attached to the second copy as punct. We want to analyze reduplication as a single token. Fix it.