udapi.block.ud package¶
Subpackages¶
- udapi.block.ud.bg package
- udapi.block.ud.cs package
- udapi.block.ud.de package
- udapi.block.ud.el package
- udapi.block.ud.es package
- udapi.block.ud.fr package
- udapi.block.ud.ga package
- udapi.block.ud.gl package
- udapi.block.ud.he package
- udapi.block.ud.pt package
- udapi.block.ud.ro package
- udapi.block.ud.ru package
Submodules¶
udapi.block.ud.addmwt module¶
Abstract base class ud.AddMwt for heuristic detection of multi-word tokens.
-
class
udapi.block.ud.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Detect and mark MWTs (split them into words and add the words to the tree).
-
multiword_analysis
(node)[source]¶ Return a dict with MWT info or None if node does not represent a multiword token.
An example return value is:
{
‘form’: ‘aby bych’, ‘lemma’: ‘aby být’, ‘upos’: ‘SCONJ AUX’, ‘xpos’: ‘J,————- Vc-S—1——-‘, ‘feats’: ‘_ Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin’, # _ means empty FEATS ‘deprel’: ‘* aux’, # * means keep the original deprel ‘main’: 0, # which of the two words will inherit the original children (if any) ‘shape’: ‘siblings’, # the newly created nodes will be siblings or alternatively #’shape’: ‘subtree’, # the main-indexed node will be the head}
-
udapi.block.ud.complywithtext module¶
Block ComplyWithText for adapting the nodes to comply with the text.
Implementation design details:
Usually, most of the inconsistencies between tree tokens and the raw text are simple to solve.
However, there may be also rare cases when it is not clear how to align the tokens
(nodes in the tree) with the raw text (stored in root.text
).
This block tries to solve the general case using several heuristics.
It starts with running a LCS-like algorithm (LCS = longest common subsequence)
difflib.SequenceMatcher
on the raw text and concatenation of tokens’ forms,
i.e. on sequences of characters (as opposed to running LCS on sequences of tokens).
To prevent mis-alignment problems, we keep the spaces present in the raw text
and we insert spaces into the concatenated forms (tree_chars
) according to SpaceAfter=No
.
An example of a mis-alignment problem:
text “énfase na necesidade” with 4 nodes “énfase en a necesidade”
should be solved by adding multiword token “na” over the nodes “en” and “a”.
However, running LCS (or difflib) over the character sequences
“énfaseenanecesidade”
“énfasenanecesidade”
may result in énfase -> énfas.
Author: Martin Popel
-
class
udapi.block.ud.complywithtext.
ComplyWithText
(fix_text=True, prefer_mwt=True, allow_goeswith=True, max_mwt_length=4, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Adapt the nodes to comply with the text.
-
merge_diffs
(orig_diffs, char_nodes)[source]¶ Make sure each diff starts on original token boundary.
If not, merge the diff with the previous diff. E.g. (equal, “5”, “5”), (replace, “-6”, “–7”) is changed into (replace, “5-6”, “5–7”)
-
udapi.block.ud.convert1to2 module¶
Block Convert1to2 for converting UD v1 to UD v2.
See http://universaldependencies.org/v2/summary.html for the description of all UD v2 changes. IMPORTANT: this code does only SOME of the changes and the output should be checked.
Note that this block is not idempotent, i.e. you should not apply it twice on the same data. It should be idempotent when skipping the coordination transformations (skip=coord).
Author: Martin Popel, based on https://github.com/UniversalDependencies/tools/tree/master/v2-conversion by Sebastian Schuster.
-
class
udapi.block.ud.convert1to2.
Convert1to2
(skip='', save_stats=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for converting UD v1 to UD v2.
-
HEAD_PROMOTION
= {'advcl': 1, 'advmod': 5, 'ccomp': 2, 'csubj': 4, 'iobj': 7, 'nsubj': 9, 'obj': 8, 'obl': 6, 'xcomp': 3}¶
-
change_deprel_simple
(node)[source]¶ mwe→fixed, dobj→obj, pass→:pass, name→flat, foreign→flat+Foreign=Yes.
-
change_feats
(node)[source]¶ Negative→Polarity, Aspect=Pro→Prosp, VerbForm=Trans→Conv, Definite=Red→Cons,…
Also Foreign=Foreign→Yes and log if Tense=NarTense=Nar or NumType=GenNumType=Gen is used.
-
static
change_headfinal
(node, deprel)[source]¶ deprel=goeswith|flat|fixed|appos must be a head-initial flat structure.
-
change_neg
(node)[source]¶ neg→advmod/det/ToDo + Polarity=Neg.
In addition, if there is a node with deprel=neg and upos=INTJ, it is checked whether it is possibly a real interjection or a negation particle, which should have upos=PART (as documented in http://universaldependencies.org/u/pos/PART.html) This kind of error (INTJ instead of PART for “не”) is common e.g. in Bulgarian v1.4, but I hope the rule is language independent (enough to be included here).
-
fix_remnants_in_tree
(root)[source]¶ Change ellipsis with remnant deprels to UDv2 ellipsis with orphans.
Remnant’s parent is always the correlate (same-role) node. Usually, correlate’s parent is the head of the whole ellipsis subtree, i.e. the first conjunct. However, sometimes remnants are deeper, e.g. ‘Over 300 Iraqis are reported dead and 500 wounded.’ with edges:
nsubjpass(reported, Iraqis) nummod(Iraqis, 300) remnant(300, 500)
Let’s expect all remnants in one tree are part of the same ellipsis structure.
TODO: theoretically, there may be more ellipsis structures with remnants in one tree, but I have no idea how to distinguish them from the deeper-remnants cases.
-
static
is_nominal
(node)[source]¶ Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.
Used in change_nmod.
-
static
is_verbal
(node)[source]¶ Returns True for verbs and nodes with copula child.
Used in change_neg.
-
log
(node, short_msg, long_msg)[source]¶ Log node.address() + long_msg and add ToDo=short_msg to node.misc.
-
process_tree
(tree)[source]¶ Apply all the changes on the current tree.
This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.
-
udapi.block.ud.exgoogle2ud module¶
Block ud.ExGoogle2ud converts data which were originally annotated in Google style then converted with an older version of ud.Google2ud to UDv2, then manually edited and we don’t want to loose these edits, so we cannot simply rerun the newer version of ud.Google2ud on the original Google data.
-
class
udapi.block.ud.exgoogle2ud.
ExGoogle2ud
(lang='unk', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Convert former Google Universal Dependency Treebank into UD style.
udapi.block.ud.fixchain module¶
Block ud.FixChain for making sure deprel=fixed|flat|goeswith|list does not form a chain.
-
class
udapi.block.ud.fixchain.
FixChain
(deprels='fixed, flat, goeswith, list', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure deprel=fixed etc. does not form a chain, but a flat structure.
udapi.block.ud.fixpunct module¶
Block ud.FixPunct for making sure punctuation is attached projectively.
Punctuation in Universal Dependencies has the tag PUNCT, dependency relation punct, and is always attached projectively, usually to the head of a neighboring subtree to its left or right. Punctuation normally does not have children. If it does, we will fix it first.
This block tries to re-attach punctuation projectively and according to the guidelines. It should help in cases where punctuation is attached randomly, always to the root or always to the neighboring word. However, there are limits to what it can do; for example it cannot always recognize whether a comma is introduced to separate the block to its left or to its right. Hence if the punctuation before running this block is almost good, the block may actually do more harm than good.
Since the punctuation should not have children, we should not create a non-projectivity if we check the root edges going to the right. However, it is still possible that we will attach the punctuation non-projectively by joining a non-projectivity that already exists. For example, the left neighbor (node i-1) may have its parent at i-3, and the node i-2 forms a gap (does not depend on i-3).
-
class
udapi.block.ud.fixpunct.
FixPunct
(**kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure punctuation nodes are attached projectively.
udapi.block.ud.fixpunctchild module¶
Block ud.FixPunctChild for making sure punctuation nodes have no children.
-
class
udapi.block.ud.fixpunctchild.
FixPunctChild
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Make sure punct nodes have no children by rehanging the children upwards.
udapi.block.ud.fixrightheaded module¶
Block ud.FixRightheaded for making sure flat,fixed,appos,goeswith,list is head initial.
Note that deprel=conj should also be left-headed, but it is not included in this fix-block by default because coordinations are more difficult to convert and one should use a specialized block instead.
-
class
udapi.block.ud.fixrightheaded.
FixRightheaded
(deprels='flat, fixed, appos, goeswith, list', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure deprel=flat,fixed,… form a head-initial (i.e. left-headed) structure.
udapi.block.ud.goeswithfromtext module¶
Block GoeswithFromText for splitting nodes and attaching via goeswith according to the text.
Usage: udapy -s ud.GoeswithFromText < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.goeswithfromtext.
GoeswithFromText
(keep_lemma=False, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for splitting nodes and attaching via goeswith according to the the sentence text.
For example:: # text = Never the less, I agree. 1 Nevertheless nevertheless ADV _ _ 4 advmod _ SpaceAfter=No 2 , , PUNCT _ _ 4 punct _ _ 3 I I PRON _ _ 4 nsubj _ _ 4 agree agree VERB _ _ 0 root _ SpaceAfter=No 5 . . PUNCT _ _ 4 punct _ _
is changed to:: # text = Never the less, I agree. 1 Never never ADV _ _ 6 advmod _ _ 2 the the ADV _ _ 1 goeswith _ _ 3 less less ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _
If used with parameter keep_lemma=1, the result is:: # text = Never the less, I agree. 1 Never nevertheless ADV _ _ 6 advmod _ _ 2 the _ ADV _ _ 1 goeswith _ _ 3 less _ ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _
udapi.block.ud.google2ud module¶
Block ud.Google2ud for converting Google Universal Dependency Treebank into UD.
Usage: udapy -s ud.Google2ud < google.conllu > ud2.conllu
-
class
udapi.block.ud.google2ud.
Google2ud
(lang='unk', non_mwt_langs='ar en ja ko zh', **kwargs)[source]¶ Bases:
udapi.block.ud.convert1to2.Convert1to2
Convert Google Universal Dependency Treebank into UD style.
-
fix_deprel
(node)[source]¶ Convert Google dependency relations to UD deprels.
Change topology where needed.
-
static
fix_feats
(node)[source]¶ Remove language prefixes, capitalize names and values, apply FEATS_CHANGE.
-
fix_goeswith
(node)[source]¶ Solve deprel=goeswith which is almost always wrong in the Google annotation.
-
fix_multiword_prep
(node)[source]¶ Solve pobj/pcomp depending on pobj/pcomp.
Only some of these cases are multi-word prepositions (which should get deprel=fixed).
-
process_tree
(root)[source]¶ Apply all the changes on the current tree.
This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.
-
udapi.block.ud.joinasmwt module¶
Block ud.JoinAsMwt for creating multi-word tokens
if multiple neighboring words are not separated by a space and the boundaries between the word forms are alphabetical.
-
class
udapi.block.ud.joinasmwt.
JoinAsMwt
(revert_orig_form=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Create MWTs if words are not separated by a space..
udapi.block.ud.markbugs module¶
Block MarkBugs for checking suspicious/wrong constructions in UD v2.
See http://universaldependencies.org/release_checklist.html#syntax and http://universaldependencies.org/svalidation.html IMPORTANT: the svalidation.html overview is not generated by this code, but by SETS-search-interface rules, which may give different results than this code.
Usage: udapy -s ud.MarkBugs < in.conllu > marked.conllu 2> log.txt
Errors are both logged to stderr and marked within the nodes’ MISC field, e.g. node.misc[‘Bug’] = ‘aux-chain’, so the output conllu file can be searched for “Bug=” occurences.
Author: Martin Popel based on descriptions at http://universaldependencies.org/svalidation.html
-
class
udapi.block.ud.markbugs.
MarkBugs
(save_stats=True, tests=None, skip=None, max_cop_lemmas=2, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for checking suspicious/wrong constructions in UD v2.
udapi.block.ud.removemwt module¶
Block ud.RemoveMwt for removing multi-word tokens.
-
class
udapi.block.ud.removemwt.
RemoveMwt
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Substitute MWTs with one word representing the whole MWT.
udapi.block.ud.setspaceafter module¶
Block SetSpaceAfter for heuristic setting of SpaceAfter=No.
Usage: udapy -s ud.SetSpaceAfter < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.setspaceafter.
SetSpaceAfter
(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for heuristic setting of the SpaceAfter=No MISC attribute.
-
static
is_goeswith_exception
(node)[source]¶ Is this node excepted from SpaceAfter=No because of the goeswith deprel?
Deprel=goeswith means that a space was (incorrectly) present in the original text, so we should not add SpaceAfter=No in these cases. We expect valid annotation of goeswith (no gaps, first token as head).
-
static
udapi.block.ud.setspaceafterfromtext module¶
Block SetSpaceAfterFromText for setting of SpaceAfter=No according to the sentence text.
Usage: udapy -s ud.SetSpaceAfterFromText < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.setspaceafterfromtext.
SetSpaceAfterFromText
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for setting of the SpaceAfter=No MISC attribute according to the sentence text.
udapi.block.ud.splitunderscoretokens module¶
Block ud.SplitUnderscoreTokens splits tokens with underscores are attaches them using flat.
Usage: udapy -s ud.SplitUnderscoreTokens < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.splitunderscoretokens.
SplitUnderscoreTokens
(deprel=None, default_deprel='flat', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for spliting tokens with underscores and attaching the new nodes using deprel=flat.
E.g.:: 1 Hillary_Rodham_Clinton Hillary_Rodham_Clinton PROPN xpos 0 dep
is transformed into: 1 Hillary Hillary PROPN xpos 0 dep 2 Rodham Rodham PROPN xpos 1 flat 3 Clinton Clinton PROPN xpos 1 flat
Real-world use cases: UD_Irish (default_deprel=fixed) and UD_Czech-CLTT v1.4.
-
deprel_for
(node)[source]¶ Return deprel of the newly created nodes: flat, fixed, compound or its subtypes.
See http://universaldependencies.org/u/dep/flat.html http://universaldependencies.org/u/dep/fixed.html http://universaldependencies.org/u/dep/compound.html Note that unlike the first two, deprel=compound does not need to be head-initial.
This method implements a coarse heuristic rules to decide between fixed and flat.
-