udapi.block.ud package¶

Subpackages¶

Submodules¶

udapi.block.ud.addmwt module¶

Abstract base class ud.AddMwt for heuristic detection of multi-word tokens.

class udapi.block.ud.addmwt.AddMwt(zones='all')[source]¶

Bases: udapi.core.block.Block

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]¶

Return a dict with MWT info or None if node does not represent a multiword token.

An example return value is:

‘form’: ‘aby bych’, ‘lemma’: ‘aby být’, ‘upos’: ‘SCONJ AUX’, ‘xpos’: ‘J,————- Vc-S—1——-‘, ‘feats’: ‘_ Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin’, # _ means empty FEATS ‘deprel’: ‘* aux’, # * means keep the original deprel ‘main’: 0, # which of the two words will inherit the original children (if any) ‘shape’: ‘siblings’, # the newly created nodes will be siblings or alternatively #’shape’: ‘subtree’, # the main-indexed node will be the head

}

postprocess_mwt(mwt)[source]¶: Optional postprocessing of newly created MWTs.

process_node(node)[source]¶: Process a UD node

udapi.block.ud.complywithtext module¶

Block ComplyWithText for adapting the nodes to comply with the text.

Implementation design details: Usually, most of the inconsistencies between tree tokens and the raw text are simple to solve. However, there may be also rare cases when it is not clear how to align the tokens (nodes in the tree) with the raw text (stored in root.text). This block tries to solve the general case using several heuristics.

It starts with running a LCS-like algorithm (LCS = longest common subsequence) difflib.SequenceMatcher on the raw text and concatenation of tokens’ forms, i.e. on sequences of characters (as opposed to running LCS on sequences of tokens).

To prevent mis-alignment problems, we keep the spaces present in the raw text and we insert spaces into the concatenated forms (tree_chars) according to SpaceAfter=No. An example of a mis-alignment problem: text “énfase na necesidade” with 4 nodes “énfase en a necesidade” should be solved by adding multiword token “na” over the nodes “en” and “a”. However, running LCS (or difflib) over the character sequences “énfaseenanecesidade” “énfasenanecesidade” may result in énfase -> énfas.

Author: Martin Popel

class udapi.block.ud.complywithtext.ComplyWithText(fix_text=True, prefer_mwt=True, allow_goeswith=True, max_mwt_length=4, **kwargs)[source]¶

Bases: udapi.core.block.Block

Adapt the nodes to comply with the text.

static allow_space(form)[source]¶: Is space allowed within this token form?

merge_diffs(orig_diffs, char_nodes)[source]¶

Make sure each diff starts on original token boundary.

If not, merge the diff with the previous diff. E.g. (equal, “5”, “5”), (replace, “-6”, “–7”) is changed into (replace, “5-6”, “5–7”)

process_tree(root)[source]¶: Process a UD tree

solve_diff(nodes, form)[source]¶: Fix a given (minimal) tokens-vs-text inconsistency.

solve_diffs(diffs, tree_chars, char_nodes, text)[source]¶

static store_orig_form(node, new_form)[source]¶: Store the original form of this node into MISC, unless the change is common&expected.

unspace_diffs(orig_diffs, tree_chars, text)[source]¶

udapi.block.ud.convert1to2 module¶

Block Convert1to2 for converting UD v1 to UD v2.

See http://universaldependencies.org/v2/summary.html for the description of all UD v2 changes. IMPORTANT: this code does only SOME of the changes and the output should be checked.

Note that this block is not idempotent, i.e. you should not apply it twice on the same data. It should be idempotent when skipping the coordination transformations (skip=coord).

Author: Martin Popel, based on https://github.com/UniversalDependencies/tools/tree/master/v2-conversion by Sebastian Schuster.

class udapi.block.ud.convert1to2.Convert1to2(skip='', save_stats=True, **kwargs)[source]¶

Bases: udapi.core.block.Block

Block for converting UD v1 to UD v2.

HEAD_PROMOTION = {'advcl': 1, 'advmod': 5, 'ccomp': 2, 'csubj': 4, 'iobj': 7, 'nsubj': 9, 'obj': 8, 'obl': 6, 'xcomp': 3}¶

after_process_document(document)[source]¶: Print overall statistics of ToDo counts.

change_deprel_simple(node)[source]¶: mwe→fixed, dobj→obj, pass→:pass, name→flat, foreign→flat+Foreign=Yes.

change_feats(node)[source]¶

Negative→Polarity, Aspect=Pro→Prosp, VerbForm=Trans→Conv, Definite=Red→Cons,…

Also Foreign=Foreign→Yes and log if Tense=NarTense=Nar or NumType=GenNumType=Gen is used.

static change_headfinal(node, deprel)[source]¶: deprel=goeswith|flat|fixed|appos must be a head-initial flat structure.

change_neg(node)[source]¶

neg→advmod/det/ToDo + Polarity=Neg.

In addition, if there is a node with deprel=neg and upos=INTJ, it is checked whether it is possibly a real interjection or a negation particle, which should have upos=PART (as documented in http://universaldependencies.org/u/pos/PART.html) This kind of error (INTJ instead of PART for “не”) is common e.g. in Bulgarian v1.4, but I hope the rule is language independent (enough to be included here).

change_nmod(node)[source]¶: nmod→obl if parent is not nominal, but predicate.

static change_upos(node)[source]¶: CONJ→CCONJ.

static change_upos_copula(node)[source]¶: deprel=cop needs upos=AUX (or PRON).

fix_remnants_in_tree(root)[source]¶

Change ellipsis with remnant deprels to UDv2 ellipsis with orphans.

Remnant’s parent is always the correlate (same-role) node. Usually, correlate’s parent is the head of the whole ellipsis subtree, i.e. the first conjunct. However, sometimes remnants are deeper, e.g. ‘Over 300 Iraqis are reported dead and 500 wounded.’ with edges:

nsubjpass(reported, Iraqis)
nummod(Iraqis, 300)
remnant(300, 500)

Let’s expect all remnants in one tree are part of the same ellipsis structure.

TODO: theoretically, there may be more ellipsis structures with remnants in one tree, but I have no idea how to distinguish them from the deeper-remnants cases.

fix_text(root)[source]¶: Make sure root.text is filled and matching the forms+SpaceAfter=No.

static is_nominal(node)[source]¶

Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.

Used in change_nmod.

static is_verbal(node)[source]¶

Returns True for verbs and nodes with copula child.

Used in change_neg.

log(node, short_msg, long_msg)[source]¶: Log node.address() + long_msg and add ToDo=short_msg to node.misc.

process_tree(tree)[source]¶

Apply all the changes on the current tree.

This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.

reattach_coordinations(node)[source]¶: cc and punct in coordinations should depend on the immediately following conjunct.

udapi.block.ud.exgoogle2ud module¶

Block ud.ExGoogle2ud converts data which were originally annotated in Google style then converted with an older version of ud.Google2ud to UDv2, then manually edited and we don’t want to loose these edits, so we cannot simply rerun the newer version of ud.Google2ud on the original Google data.

class udapi.block.ud.exgoogle2ud.ExGoogle2ud(lang='unk', **kwargs)[source]¶

Bases: udapi.core.block.Block

Convert former Google Universal Dependency Treebank into UD style.

fix_node(node)[source]¶: Various fixed taken from ud.Google2ud.

static is_nominal(node)[source]¶

Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.

Used in change_nmod.

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.fixchain module¶

Block ud.FixChain for making sure deprel=fixed|flat|goeswith|list does not form a chain.

class udapi.block.ud.fixchain.FixChain(deprels='fixed, flat, goeswith, list', **kwargs)[source]¶

Bases: udapi.core.block.Block

Make sure deprel=fixed etc. does not form a chain, but a flat structure.

process_node(node)[source]¶: Process a UD node

udapi.block.ud.fixpunct module¶

Block ud.FixPunct for making sure punctuation is attached projectively.

Punctuation in Universal Dependencies has the tag PUNCT, dependency relation punct, and is always attached projectively, usually to the head of a neighboring subtree to its left or right. Punctuation normally does not have children. If it does, we will fix it first.

This block tries to re-attach punctuation projectively and according to the guidelines. It should help in cases where punctuation is attached randomly, always to the root or always to the neighboring word. However, there are limits to what it can do; for example it cannot always recognize whether a comma is introduced to separate the block to its left or to its right. Hence if the punctuation before running this block is almost good, the block may actually do more harm than good.

Since the punctuation should not have children, we should not create a non-projectivity if we check the root edges going to the right. However, it is still possible that we will attach the punctuation non-projectively by joining a non-projectivity that already exists. For example, the left neighbor (node i-1) may have its parent at i-3, and the node i-2 forms a gap (does not depend on i-3).

class udapi.block.ud.fixpunct.FixPunct(**kwargs)[source]¶

Bases: udapi.core.block.Block

Make sure punctuation nodes are attached projectively.

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.fixpunctchild module¶

Block ud.FixPunctChild for making sure punctuation nodes have no children.

class udapi.block.ud.fixpunctchild.FixPunctChild(zones='all')[source]¶

Bases: udapi.core.block.Block

Make sure punct nodes have no children by rehanging the children upwards.

process_node(node)[source]¶: Process a UD node

udapi.block.ud.fixrightheaded module¶

Block ud.FixRightheaded for making sure flat,fixed,appos,goeswith,list is head initial.

Note that deprel=conj should also be left-headed, but it is not included in this fix-block by default because coordinations are more difficult to convert and one should use a specialized block instead.

class udapi.block.ud.fixrightheaded.FixRightheaded(deprels='flat, fixed, appos, goeswith, list', **kwargs)[source]¶

Bases: udapi.core.block.Block

Make sure deprel=flat,fixed,… form a head-initial (i.e. left-headed) structure.

process_node(node)[source]¶: Process a UD node

udapi.block.ud.goeswithfromtext module¶

Block GoeswithFromText for splitting nodes and attaching via goeswith according to the text.

Usage: udapy -s ud.GoeswithFromText < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.goeswithfromtext.GoeswithFromText(keep_lemma=False, **kwargs)[source]¶

Bases: udapi.core.block.Block

Block for splitting nodes and attaching via goeswith according to the the sentence text.

For example:: # text = Never the less, I agree. 1 Nevertheless nevertheless ADV _ _ 4 advmod _ SpaceAfter=No 2 , , PUNCT _ _ 4 punct _ _ 3 I I PRON _ _ 4 nsubj _ _ 4 agree agree VERB _ _ 0 root _ SpaceAfter=No 5 . . PUNCT _ _ 4 punct _ _

is changed to:: # text = Never the less, I agree. 1 Never never ADV _ _ 6 advmod _ _ 2 the the ADV _ _ 1 goeswith _ _ 3 less less ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _

If used with parameter keep_lemma=1, the result is:: # text = Never the less, I agree. 1 Never nevertheless ADV _ _ 6 advmod _ _ 2 the _ ADV _ _ 1 goeswith _ _ 3 less _ ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.google2ud module¶

Block ud.Google2ud for converting Google Universal Dependency Treebank into UD.

Usage: udapy -s ud.Google2ud < google.conllu > ud2.conllu

class udapi.block.ud.google2ud.Google2ud(lang='unk', non_mwt_langs='ar en ja ko zh', **kwargs)[source]¶

Bases: udapi.block.ud.convert1to2.Convert1to2

Convert Google Universal Dependency Treebank into UD style.

fix_deprel(node)[source]¶

Convert Google dependency relations to UD deprels.

Change topology where needed.

static fix_feats(node)[source]¶: Remove language prefixes, capitalize names and values, apply FEATS_CHANGE.

fix_goeswith(node)[source]¶: Solve deprel=goeswith which is almost always wrong in the Google annotation.

fix_multiword_prep(node)[source]¶

Solve pobj/pcomp depending on pobj/pcomp.

Only some of these cases are multi-word prepositions (which should get deprel=fixed).

fix_upos(node)[source]¶: PRT→PART, .→PUNCT, NOUN+Proper→PROPN, VERB+neg→AUX etc.

process_tree(root)[source]¶

Apply all the changes on the current tree.

This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.

udapi.block.ud.joinasmwt module¶

Block ud.JoinAsMwt for creating multi-word tokens

if multiple neighboring words are not separated by a space and the boundaries between the word forms are alphabetical.

class udapi.block.ud.joinasmwt.JoinAsMwt(revert_orig_form=True, **kwargs)[source]¶

Bases: udapi.core.block.Block

Create MWTs if words are not separated by a space..

process_node(node)[source]¶: Process a UD node

udapi.block.ud.markbugs module¶

Block MarkBugs for checking suspicious/wrong constructions in UD v2.

See http://universaldependencies.org/release_checklist.html#syntax and http://universaldependencies.org/svalidation.html IMPORTANT: the svalidation.html overview is not generated by this code, but by SETS-search-interface rules, which may give different results than this code.

Usage: udapy -s ud.MarkBugs < in.conllu > marked.conllu 2> log.txt

Errors are both logged to stderr and marked within the nodes’ MISC field, e.g. node.misc[‘Bug’] = ‘aux-chain’, so the output conllu file can be searched for “Bug=” occurences.

Author: Martin Popel based on descriptions at http://universaldependencies.org/svalidation.html

class udapi.block.ud.markbugs.MarkBugs(save_stats=True, tests=None, skip=None, max_cop_lemmas=2, **kwargs)[source]¶

Bases: udapi.core.block.Block

Block for checking suspicious/wrong constructions in UD v2.

after_process_document(document)[source]¶: This method is called after each process_document.

log(node, short_msg, long_msg)[source]¶: Log node.address() + long_msg and add ToDo=short_msg to node.misc.

process_node(node)[source]¶: Process a UD node

udapi.block.ud.removemwt module¶

Block ud.RemoveMwt for removing multi-word tokens.

class udapi.block.ud.removemwt.RemoveMwt(zones='all')[source]¶

Bases: udapi.core.block.Block

Substitute MWTs with one word representing the whole MWT.

static guess_deprel(words)[source]¶: DEPREL of the whole MWT

static guess_feats(words)[source]¶: FEATS of the whole MWT

static guess_upos(words)[source]¶: UPOS of the whole MWT

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.setspaceafter module¶

Block SetSpaceAfter for heuristic setting of SpaceAfter=No.

Usage: udapy -s ud.SetSpaceAfter < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.setspaceafter.SetSpaceAfter(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]¶

Bases: udapi.core.block.Block

Block for heuristic setting of the SpaceAfter=No MISC attribute.

static is_goeswith_exception(node)[source]¶

Is this node excepted from SpaceAfter=No because of the goeswith deprel?

Deprel=goeswith means that a space was (incorrectly) present in the original text, so we should not add SpaceAfter=No in these cases. We expect valid annotation of goeswith (no gaps, first token as head).

mark_no_space(node)[source]¶: Mark a node with SpaceAfter=No unless it is a goeswith exception.

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.setspaceafterfromtext module¶

Block SetSpaceAfterFromText for setting of SpaceAfter=No according to the sentence text.

Usage: udapy -s ud.SetSpaceAfterFromText < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.setspaceafterfromtext.SetSpaceAfterFromText(zones='all')[source]¶

Bases: udapi.core.block.Block

Block for setting of the SpaceAfter=No MISC attribute according to the sentence text.

process_tree(root)[source]¶: Process a UD tree

udapi.block.ud.splitunderscoretokens module¶

Block ud.SplitUnderscoreTokens splits tokens with underscores are attaches them using flat.

Usage: udapy -s ud.SplitUnderscoreTokens < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.splitunderscoretokens.SplitUnderscoreTokens(deprel=None, default_deprel='flat', **kwargs)[source]¶

Bases: udapi.core.block.Block

Block for spliting tokens with underscores and attaching the new nodes using deprel=flat.

E.g.:: 1 Hillary_Rodham_Clinton Hillary_Rodham_Clinton PROPN xpos 0 dep

is transformed into: 1 Hillary Hillary PROPN xpos 0 dep 2 Rodham Rodham PROPN xpos 1 flat 3 Clinton Clinton PROPN xpos 1 flat

Real-world use cases: UD_Irish (default_deprel=fixed) and UD_Czech-CLTT v1.4.

deprel_for(node)[source]¶

Return deprel of the newly created nodes: flat, fixed, compound or its subtypes.

See http://universaldependencies.org/u/dep/flat.html http://universaldependencies.org/u/dep/fixed.html http://universaldependencies.org/u/dep/compound.html Note that unlike the first two, deprel=compound does not need to be head-initial.

This method implements a coarse heuristic rules to decide between fixed and flat.

process_node(node)[source]¶: Process a UD node

udapi.block.ud package¶

Subpackages¶

Submodules¶

udapi.block.ud.addmwt module¶

udapi.block.ud.complywithtext module¶

udapi.block.ud.convert1to2 module¶

udapi.block.ud.exgoogle2ud module¶

udapi.block.ud.fixchain module¶

udapi.block.ud.fixpunct module¶

udapi.block.ud.fixpunctchild module¶

udapi.block.ud.fixrightheaded module¶

udapi.block.ud.goeswithfromtext module¶

udapi.block.ud.google2ud module¶

udapi.block.ud.joinasmwt module¶

udapi.block.ud.markbugs module¶

udapi.block.ud.removemwt module¶

udapi.block.ud.setspaceafter module¶

udapi.block.ud.setspaceafterfromtext module¶

udapi.block.ud.splitunderscoretokens module¶

Module contents¶