udapi.block.ud.ca.addmwt module

Block ud.ca.AddMwt for heuristic detection of Catalan contractions.

According to the UD guidelines, contractions such as “del” = “de el” should be annotated using multi-word tokens.

Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.

class udapi.block.ud.ca.addmwt.AddMwt(verbpron=False, **kwargs)[source]

Bases: AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

fix_personal_pronoun(node)[source]
multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

report_suspicious_lemmas(node)[source]