udapi.block.ud.addmwt module

Abstract base class ud.AddMwt for heuristic detection of multi-word tokens.

class udapi.block.ud.addmwt.AddMwt(zones='all', if_empty_tree='process', **kwargs)[source]

Bases: Block

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

An example return value is:

{

‘form’: ‘aby bych’, ‘lemma’: ‘aby být’, ‘upos’: ‘SCONJ AUX’, ‘xpos’: ‘J,————- Vc-S—1——-‘, ‘feats’: ‘_ Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin’, # _ means empty FEATS ‘deprel’: ‘* aux’, # * means keep the original deprel ‘main’: 0, # which of the two words will inherit the original children (if any) ‘shape’: ‘siblings’, # the newly created nodes will be siblings or alternatively #’shape’: ‘subtree’, # the main-indexed node will be the head

}

postprocess_mwt(mwt)[source]

Optional postprocessing of newly created MWTs.

process_node(node)[source]

Process a UD node