Welcome to Udapi’s documentation!¶
Udapi is a framework providing an API for processing Universal Dependencies data.
Installation¶
You need Python 3.3 or higher, pip3 and git.
Let’s clone the git repo to ~/udapi-python/
, install dependencies
and setup $PATH
and $PYTHONPATH
accordingly:
cd
git clone https://github.com/udapi/udapi-python.git
pip3 install --user -r udapi-python/requirements.txt
echo '## Use Udapi from ~/udapi-python/ ##' >> ~/.bashrc
echo 'export PATH="$HOME/udapi-python/bin:$PATH"' >> ~/.bashrc
echo 'export PYTHONPATH="$HOME/udapi-python/:$PYTHONPATH"' >> ~/.bashrc
source ~/.bashrc # or open new bash
API Documentation¶
udapi
package¶
Sub-modules
udapi¶
udapi package¶
Subpackages¶
RehangPrepositions demo block.
-
class
udapi.block.demo.rehangprepositions.
RehangPrepositions
(zones='all')[source]¶ Bases:
udapi.core.block.Block
This block takes all prepositions (upos=ADP) and rehangs them above their parent.
Block&script eval.Conll17 for evaluating LAS,UAS,etc as in CoNLL2017 UD shared task.
This is a reimplementation of the CoNLL2017 shared task official evaluation script, http://universaldependencies.org/conll17/evaluation.html
The gold trees and predicted (system-output) trees need to be sentence-aligned e.g. using util.ResegmentGold. Unlike in eval.Parsing, the gold and predicted trees can have different tokenization.
An example usage and output:
$ udapy read.Conllu zone=gold files=gold.conllu \
read.Conllu zone=pred files=pred.conllu ignore_sent_id=1 \
eval.Conll17
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Words | 27.91 | 52.17 | 36.36 | 100.00
UPOS | 27.91 | 52.17 | 36.36 | 100.00
XPOS | 27.91 | 52.17 | 36.36 | 100.00
Feats | 27.91 | 52.17 | 36.36 | 100.00
Lemma | 27.91 | 52.17 | 36.36 | 100.00
UAS | 16.28 | 30.43 | 21.21 | 58.33
LAS | 16.28 | 30.43 | 21.21 | 58.33
CLAS | 10.34 | 16.67 | 12.77 | 37.50
For evaluating multiple systems and testsets (as in CoNLL2017) stored in systems/testset_name/system_name.conllu you can use:
#!/bin/bash
SYSTEMS=`ls systems`
[[ $# -ne 0 ]] && SYSTEMS=$@
set -x
set -e
for sys in $SYSTEMS; do
mkdir -p results/$sys
for testset in `ls systems/$sys`; do
udapy read.Conllu zone=gold files=gold/$testset \
read.Conllu zone=pred files=systems/$sys/$testset ignore_sent_id=1 \
util.ResegmentGold \
eval.Conll17 print_results=0 print_raw=1 \
> results/$sys/${testset%.conllu}
done
done
python3 `python3 -c 'import udapi.block.eval.conll17 as x; print(x.__file__)'` -r 100
The last line executes this block as a script and computes bootstrap resampling with 100 resamples (default=1000, it is recommended to keep the default or higher value unless testing the interface). This prints the ranking and confidence intervals (95% by default) and also p-values for each pair of systems with neighboring ranks. If the difference in LAS is significant (according to a paired bootstrap test, by default if p < 0.05), a line is printed between the two systems.
The output looks like:
1. Stanford 76.17 ± 0.12 (76.06 .. 76.30) p=0.001
------------------------------------------------------------
2. C2L2 74.88 ± 0.12 (74.77 .. 75.01) p=0.001
------------------------------------------------------------
3. IMS 74.29 ± 0.13 (74.16 .. 74.43) p=0.001
------------------------------------------------------------
4. HIT-SCIR 71.99 ± 0.14 (71.84 .. 72.12) p=0.001
------------------------------------------------------------
5. LATTICE 70.81 ± 0.13 (70.67 .. 70.94) p=0.001
------------------------------------------------------------
6. NAIST-SATO 70.02 ± 0.13 (69.89 .. 70.16) p=0.001
------------------------------------------------------------
7. Koc-University 69.66 ± 0.13 (69.52 .. 69.79) p=0.002
------------------------------------------------------------
8. UFAL-UDPipe-1-2 69.36 ± 0.13 (69.22 .. 69.49) p=0.001
------------------------------------------------------------
9. UParse 68.75 ± 0.14 (68.62 .. 68.89) p=0.003
------------------------------------------------------------
10. Orange-Deskin 68.50 ± 0.13 (68.37 .. 68.62) p=0.448
11. TurkuNLP 68.48 ± 0.14 (68.34 .. 68.62) p=0.029
------------------------------------------------------------
12. darc 68.29 ± 0.13 (68.16 .. 68.42) p=0.334
13. conll17-baseline 68.25 ± 0.14 (68.11 .. 68.38) p=0.003
------------------------------------------------------------
14. MQuni 67.93 ± 0.13 (67.80 .. 68.06) p=0.062
15. fbaml 67.78 ± 0.13 (67.65 .. 67.91) p=0.283
16. LyS-FASTPARSE 67.73 ± 0.13 (67.59 .. 67.85) p=0.121
17. LIMSI-LIPN 67.61 ± 0.14 (67.47 .. 67.75) p=0.445
18. RACAI 67.60 ± 0.13 (67.46 .. 67.72) p=0.166
19. IIT-Kharagpur 67.50 ± 0.14 (67.36 .. 67.64) p=0.447
20. naistCL 67.49 ± 0.15 (67.34 .. 67.63)
TODO: Bootstrap currently reports only LAS, but all the other measures could be added as well.
-
class
udapi.block.eval.conll17.
Conll17
(gold_zone='gold', print_raw=False, print_results=True, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
Evaluate labeled and unlabeled attachment score (LAS and UAS).
Block eval.F1 for evaluating differences between sentences with P/R/F1.
eval.F1 zones=en_pred gold_zone=en_gold details=0
prints something like:
predicted = 210
gold = 213
correct = 210
precision = 100.00%
recall = 98.59%
F1 = 99.29%
eval.F1 gold_zone=y attributes=form,upos focus='(?i:an?|the)_DET' details=4
prints something like:
=== Details ===
token pred gold corr prec rec F1
the_DET 711 213 188 26.44% 88.26% 40.69%
The_DET 82 25 19 23.17% 76.00% 35.51%
a_DET 0 62 0 0.00% 0.00% 0.00%
an_DET 0 16 0 0.00% 0.00% 0.00%
=== Totals ===
predicted = 793
gold = 319
correct = 207
precision = 26.10%
recall = 64.89%
F1 = 37.23%
This block finds differences between nodes of trees in two zones
and reports the overall precision, recall and F1.
The two zones are “predicted” (on which this block is applied)
and “gold” (which needs to be specified with parameter gold
).
This block also reports the number of total nodes in the predicted zone
and in the gold zone and the number of “correct” nodes,
that is predicted nodes which are also in the gold zone.
By default two nodes are considered “the same” if they have the same form
,
but it is possible to check also for other nodes’ attributes
(with parameter attributes
).
As usual:
precision = correct / predicted
recall = correct / gold
F1 = 2 * precision * recall / (precision + recall)
The implementation is based on finding the longest common subsequence (LCS) between the nodes in the two trees. This means that the two zones do not need to be explicitly word-aligned.
-
class
udapi.block.eval.f1.
F1
(gold_zone, attributes='form', focus=None, details=4, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
Evaluate differences between sentences (in different zones) with P/R/F1.
Args: zones: Which zone contains the “predicted” trees?
Make sure that you specify just one zone. If you leave the default value “all” and the document contains more zones, the results will be mixed, which is most likely not what you wanted. Exception: If the document conaints just two zones (predicted and gold trees), you can keep the default value “all” because this block will skip comparison of the gold zone with itself.gold_zone: Which zone contains the gold-standard trees?
- attributes: comma separated list of attributes which should be checked
- when deciding whether two nodes are equivalent in LCS
- focus: Regular expresion constraining the tokens we are interested in.
- If more attributes were specified in the
attributes
parameter, their values are concatenated with underscore, sofocus
should reflect that e.g.attributes=form,upos focus='(a|the)_DET'
. For case-insensitive focus use e.g.focus='(?i)the'
(which is equivalent tofocus='[Tt][Hh][Ee]'
). - details: Print also detailed statistics for each token (matching the
focus
). - The value of this parameter
details
specifies the number of tokens to include. The tokens are sorted according to the sum of their predicted and gold counts.
Block eval.Parsing for evaluating UAS and LAS - gold and pred must have the same tokens.
-
class
udapi.block.eval.parsing.
Parsing
(gold_zone, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
Evaluate labeled and unlabeled attachment score (LAS and UAS).
AddSentences class is a reader for adding plain-text sentences.
-
class
udapi.block.read.addsentences.
AddSentences
(zone='', into='text', **kwargs)[source]¶ Bases:
udapi.core.basereader.BaseReader
A reader for adding plain-text sentences (one sentence per line) files.
The sentences are added to an existing trees. This is useful, e.g. if there are the original raw texts in a separate file:
cat in.conllu | udapy -s read.Conllu read.AddSentences files=in.txt > merged.conllu
“Conllu is a reader block for the CoNLL-U files.
-
class
udapi.block.read.conllu.
Conllu
(strict=False, separator='tab', empty_parent='warn', attributes='ord, form, lemma, upos, xpos, feats, head, deprel, deps, misc', **kwargs)[source]¶ Bases:
udapi.core.basereader.BaseReader
A reader of the CoNLL-U files.
Sentences class is a reader for plain-text sentences.
-
class
udapi.block.read.sentences.
Sentences
(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]¶ Bases:
udapi.core.basereader.BaseReader
A reader for plain-text sentences (one sentence per line) files.
Vislcg is a reader block the VISL-cg format.
-
class
udapi.block.read.vislcg.
Vislcg
(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]¶ Bases:
udapi.core.basereader.BaseReader
A reader of the VISL-cg format, suitable for VISL Constraint Grammer Parser.
Block tokenize.OnWhitespace
Block tokenize.Simple
-
class
udapi.block.tokenize.simple.
Simple
(zones='all')[source]¶ Bases:
udapi.block.tokenize.onwhitespace.OnWhitespace
Simple tokenizer, splits on whitespaces and punctuation, fills SpaceAfter=No.
Block Deproj for deprojectivization of pseudo-projective trees à la Nivre & Nilsson (2005).
See ud.transform.Proj for details. TODO: implement also path and head+path strategies.
transform.Flatten block for flattening trees.
-
class
udapi.block.transform.flatten.
Flatten
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Apply node.parent = node.root; node.deprel = ‘root’ on all nodes.
Block Proj for (pseudo-)projectivization of trees à la Nivre & Nilsson (2005).
See http://www.aclweb.org/anthology/P/P05/P05-1013.pdf. This block tries to replicate Malt parser’s projectivization: http://www.maltparser.org/userguide.html#singlemalt_proj http://www.maltparser.org/optiondesc.html#pproj-marking_strategy
TODO: implement also path and head+path strategies.
TODO: Sometimes it would be better (intuitively) to lower the gap-node (if its whole subtree is in the gap and if this does not cause more non-projectivities) rather than to lift several nodes whose parent-edge crosses this gap. We would need another label value (usually the lowering is of depth 1), but the advantage is that reconstruction of lowered edges during deprojectivization is simple and needs no heuristics.
-
class
udapi.block.transform.proj.
Proj
(strategy='head', lifting_order='deepest', label='misc', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Projectivize the trees à la Nivre & Nilsson (2005).
tutorial.AddArticles block template.
-
class
udapi.block.tutorial.addarticles.
AddArticles
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Heuristically insert English articles.
tutorial.AddCommas block template.
tutorial.Adpositions block template.
Example usage:
for a in */sample.conllu; do
printf '%50s ' $a;
udapy tutorial.Adpositions < $a;
done | tee results.txt
# What are the English postpositions?
cat UD_English/sample.conllu | udapy -TM util.Mark node='node.upos == "ADP" and node.parent.precedes(node)' | less -R
tutorial.Parse block template.
Usage: udapy read.Conllu zone=gold files=sample.conllu read.Conllu zone=pred files=sample.conllu transform.Flatten zones=pred tutorial.Parse zones=pred eval.Parsing gold_zone=gold util.MarkDiff gold_zone=gold write.TextModeTreesHtml marked_only=1 files=parse-diff.html
-
class
udapi.block.tutorial.parse.
Parse
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Dependency parsing.
Block ud.bg.RemoveDotAfterAbbr deletes extra PUNCT nodes after abbreviations.
Usage: udapy -s ud.bg.RemoveDotAfterAbbr < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.bg.removedotafterabbr.
RemoveDotAfterAbbr
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for deleting extra PUNCT nodes after abbreviations.
If an abrreviation is followed by end-sentence period, most languages allow just one period. However, in some treebanks (e.g. UD_Bulgarian v1.4) two periods are annotated:: # text = 1948 г. 1 1948 1948 ADJ 2 г. г. NOUN 3 . . PUNCT
The problem is that the text comment does not match with the word forms. In https://github.com/UniversalDependencies/docs/issues/410 it was decided that the least-wrong solution (and most common in other treebanks) is to delete the end-sentence punctuation:: # text = 1948 г. 1 1948 1948 ADJ 2 г. г. NOUN
This block is not specific for Bulgarian, just that UD_Bulgarian is probably the only treebank where this transformation is needed.
Block ud.cs.AddMwt for heuristic detection of multi-word tokens.
-
class
udapi.block.ud.cs.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.de.AddMwt for heuristic detection of German contractions.
According to the UD guidelines, contractions such as “am” = “an dem” should be annotated using multi-word tokens.
Notice that this should be used only for converting existing conllu files. Ideally a tokenizer should have already split the MWTs.
-
class
udapi.block.ud.de.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.el.AddMwt for heuristic detection of multi-word (σε+DET) tokens.
Notice that this should be used only for converting existing conllu files.
Ideally a tokenizer should have already split the MWTs.
Also notice that this block does not deal with the relatively rare
PRON(Person=2)+'*+PRON(Person=3, i.e. "σ'το" and "στο")
MWTs.
-
class
udapi.block.ud.el.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.es.AddMwt for heuristic detection of Spanish contractions.
According to the UD guidelines, contractions such as “dele” = “de ele” should be annotated using multi-word tokens.
Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.
-
class
udapi.block.ud.es.addmwt.
AddMwt
(verbpron=False, **kwargs)[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.fr.AddMwt for heuristic detection of French contractions.
According to the UD guidelines, contractions such as “des” = “de les” should be annotated using multi-word tokens.
Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.
-
class
udapi.block.ud.fr.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.ga.To2 UD_Irish-specific conversion of UDv1 to UDv2
Author: Martin Popel
-
class
udapi.block.ud.ga.to2.
To2
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for fixing the remaining cases (after ud.Convert1to2) in UD_Irish.
Block ud.gl.To2 UD_Galician-specific conversion of UDv1 to UDv2
Author: Martin Popel
-
class
udapi.block.ud.gl.to2.
To2
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for fixing the remaining cases (before ud.Convert1to2) in UD_Galician.
Block ud.he.FixNeg fix remaining deprel=neg
Author: Martin Popel
-
class
udapi.block.ud.he.fixneg.
FixNeg
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for fixing the remaining cases (after ud.Convert1to2) of deprel=neg in UD_Hebrew.
Block ud.pt.AddMwt for heuristic detection of Portuguese contractions.
According to the UD guidelines, contractions such as “dele” = “de ele” should be annotated using multi-word tokens.
Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.
-
class
udapi.block.ud.pt.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.block.ud.addmwt.AddMwt
Detect and mark MWTs (split them into words and add the words to the tree).
Block ud.ro.FixNeg ad-hoc fixes
Author: Martin Popel
-
class
udapi.block.ud.ro.fixneg.
FixNeg
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for fixing the remaining cases (after ud.Convert1to2) of deprel=neg in UD_Romanian.
Block ud.ro.SetSpaceAfter for heuristic setting of SpaceAfter=No in Romanian.
Usage:
udapy -s ud.ro.SetSpaceAfter < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.ro.setspaceafter.
SetSpaceAfter
(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]¶ Bases:
udapi.block.ud.setspaceafter.SetSpaceAfter
Block for heuristic setting of the SpaceAfter=No MISC attribute in Romanian.
Romanian uses many contractions, e.g.
raw meaning tokenized lemmatized n-ar nu ar n- ar nu avea să-i să îi să -i să el într-o în o într- o întru un nu-i nu îi nu -i nu el nu-i nu e nu -i nu fi Detokenization is quite simple: no space after word-final hyphen and before word-initial hyphen. There are just two exceptions, I have found: * “-” the hyphen itself (most probably it means a dash separating phrases/clauses) * negative numbers, e.g. “-3,1”
Block ud.ru.FixRemnant ad-hoc fixes
Author: Martin Popel
-
class
udapi.block.ud.ru.fixremnant.
FixRemnant
(zones='all')[source]¶ Bases:
udapi.core.block.Block
ad-hoc fixing the remaining cases (after ud.Convert1to2) of deprel=remnant in UD_Russian.
Abstract base class ud.AddMwt for heuristic detection of multi-word tokens.
-
class
udapi.block.ud.addmwt.
AddMwt
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Detect and mark MWTs (split them into words and add the words to the tree).
-
multiword_analysis
(node)[source]¶ Return a dict with MWT info or None if node does not represent a multiword token.
An example return value is:
{
‘form’: ‘aby bych’, ‘lemma’: ‘aby být’, ‘upos’: ‘SCONJ AUX’, ‘xpos’: ‘J,————- Vc-S—1——-‘, ‘feats’: ‘_ Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin’, # _ means empty FEATS ‘deprel’: ‘* aux’, # * means keep the original deprel ‘main’: 0, # which of the two words will inherit the original children (if any) ‘shape’: ‘siblings’, # the newly created nodes will be siblings or alternatively #’shape’: ‘subtree’, # the main-indexed node will be the head}
-
Block ComplyWithText for adapting the nodes to comply with the text.
Implementation design details:
Usually, most of the inconsistencies between tree tokens and the raw text are simple to solve.
However, there may be also rare cases when it is not clear how to align the tokens
(nodes in the tree) with the raw text (stored in root.text
).
This block tries to solve the general case using several heuristics.
It starts with running a LCS-like algorithm (LCS = longest common subsequence)
difflib.SequenceMatcher
on the raw text and concatenation of tokens’ forms,
i.e. on sequences of characters (as opposed to running LCS on sequences of tokens).
To prevent mis-alignment problems, we keep the spaces present in the raw text
and we insert spaces into the concatenated forms (tree_chars
) according to SpaceAfter=No
.
An example of a mis-alignment problem:
text “énfase na necesidade” with 4 nodes “énfase en a necesidade”
should be solved by adding multiword token “na” over the nodes “en” and “a”.
However, running LCS (or difflib) over the character sequences
“énfaseenanecesidade”
“énfasenanecesidade”
may result in énfase -> énfas.
Author: Martin Popel
-
class
udapi.block.ud.complywithtext.
ComplyWithText
(fix_text=True, prefer_mwt=True, allow_goeswith=True, max_mwt_length=4, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Adapt the nodes to comply with the text.
-
merge_diffs
(orig_diffs, char_nodes)[source]¶ Make sure each diff starts on original token boundary.
If not, merge the diff with the previous diff. E.g. (equal, “5”, “5”), (replace, “-6”, “–7”) is changed into (replace, “5-6”, “5–7”)
-
Block Convert1to2 for converting UD v1 to UD v2.
See http://universaldependencies.org/v2/summary.html for the description of all UD v2 changes. IMPORTANT: this code does only SOME of the changes and the output should be checked.
Note that this block is not idempotent, i.e. you should not apply it twice on the same data. It should be idempotent when skipping the coordination transformations (skip=coord).
Author: Martin Popel, based on https://github.com/UniversalDependencies/tools/tree/master/v2-conversion by Sebastian Schuster.
-
class
udapi.block.ud.convert1to2.
Convert1to2
(skip='', save_stats=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for converting UD v1 to UD v2.
-
HEAD_PROMOTION
= {'advcl': 1, 'advmod': 5, 'ccomp': 2, 'csubj': 4, 'iobj': 7, 'nsubj': 9, 'obj': 8, 'obl': 6, 'xcomp': 3}¶
-
change_deprel_simple
(node)[source]¶ mwe→fixed, dobj→obj, pass→:pass, name→flat, foreign→flat+Foreign=Yes.
-
change_feats
(node)[source]¶ Negative→Polarity, Aspect=Pro→Prosp, VerbForm=Trans→Conv, Definite=Red→Cons,…
Also Foreign=Foreign→Yes and log if Tense=NarTense=Nar or NumType=GenNumType=Gen is used.
-
static
change_headfinal
(node, deprel)[source]¶ deprel=goeswith|flat|fixed|appos must be a head-initial flat structure.
-
change_neg
(node)[source]¶ neg→advmod/det/ToDo + Polarity=Neg.
In addition, if there is a node with deprel=neg and upos=INTJ, it is checked whether it is possibly a real interjection or a negation particle, which should have upos=PART (as documented in http://universaldependencies.org/u/pos/PART.html) This kind of error (INTJ instead of PART for “не”) is common e.g. in Bulgarian v1.4, but I hope the rule is language independent (enough to be included here).
-
fix_remnants_in_tree
(root)[source]¶ Change ellipsis with remnant deprels to UDv2 ellipsis with orphans.
Remnant’s parent is always the correlate (same-role) node. Usually, correlate’s parent is the head of the whole ellipsis subtree, i.e. the first conjunct. However, sometimes remnants are deeper, e.g. ‘Over 300 Iraqis are reported dead and 500 wounded.’ with edges:
nsubjpass(reported, Iraqis) nummod(Iraqis, 300) remnant(300, 500)
Let’s expect all remnants in one tree are part of the same ellipsis structure.
TODO: theoretically, there may be more ellipsis structures with remnants in one tree, but I have no idea how to distinguish them from the deeper-remnants cases.
-
static
is_nominal
(node)[source]¶ Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.
Used in change_nmod.
-
static
is_verbal
(node)[source]¶ Returns True for verbs and nodes with copula child.
Used in change_neg.
-
log
(node, short_msg, long_msg)[source]¶ Log node.address() + long_msg and add ToDo=short_msg to node.misc.
-
process_tree
(tree)[source]¶ Apply all the changes on the current tree.
This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.
-
Block ud.ExGoogle2ud converts data which were originally annotated in Google style then converted with an older version of ud.Google2ud to UDv2, then manually edited and we don’t want to loose these edits, so we cannot simply rerun the newer version of ud.Google2ud on the original Google data.
-
class
udapi.block.ud.exgoogle2ud.
ExGoogle2ud
(lang='unk', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Convert former Google Universal Dependency Treebank into UD style.
Block ud.FixChain for making sure deprel=fixed|flat|goeswith|list does not form a chain.
-
class
udapi.block.ud.fixchain.
FixChain
(deprels='fixed, flat, goeswith, list', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure deprel=fixed etc. does not form a chain, but a flat structure.
Block ud.FixPunct for making sure punctuation is attached projectively.
Punctuation in Universal Dependencies has the tag PUNCT, dependency relation punct, and is always attached projectively, usually to the head of a neighboring subtree to its left or right. Punctuation normally does not have children. If it does, we will fix it first.
This block tries to re-attach punctuation projectively and according to the guidelines. It should help in cases where punctuation is attached randomly, always to the root or always to the neighboring word. However, there are limits to what it can do; for example it cannot always recognize whether a comma is introduced to separate the block to its left or to its right. Hence if the punctuation before running this block is almost good, the block may actually do more harm than good.
Since the punctuation should not have children, we should not create a non-projectivity if we check the root edges going to the right. However, it is still possible that we will attach the punctuation non-projectively by joining a non-projectivity that already exists. For example, the left neighbor (node i-1) may have its parent at i-3, and the node i-2 forms a gap (does not depend on i-3).
-
class
udapi.block.ud.fixpunct.
FixPunct
(**kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure punctuation nodes are attached projectively.
Block ud.FixPunctChild for making sure punctuation nodes have no children.
-
class
udapi.block.ud.fixpunctchild.
FixPunctChild
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Make sure punct nodes have no children by rehanging the children upwards.
Block ud.FixRightheaded for making sure flat,fixed,appos,goeswith,list is head initial.
Note that deprel=conj should also be left-headed, but it is not included in this fix-block by default because coordinations are more difficult to convert and one should use a specialized block instead.
-
class
udapi.block.ud.fixrightheaded.
FixRightheaded
(deprels='flat, fixed, appos, goeswith, list', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Make sure deprel=flat,fixed,… form a head-initial (i.e. left-headed) structure.
Block GoeswithFromText for splitting nodes and attaching via goeswith according to the text.
Usage: udapy -s ud.GoeswithFromText < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.goeswithfromtext.
GoeswithFromText
(keep_lemma=False, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for splitting nodes and attaching via goeswith according to the the sentence text.
For example:: # text = Never the less, I agree. 1 Nevertheless nevertheless ADV _ _ 4 advmod _ SpaceAfter=No 2 , , PUNCT _ _ 4 punct _ _ 3 I I PRON _ _ 4 nsubj _ _ 4 agree agree VERB _ _ 0 root _ SpaceAfter=No 5 . . PUNCT _ _ 4 punct _ _
is changed to:: # text = Never the less, I agree. 1 Never never ADV _ _ 6 advmod _ _ 2 the the ADV _ _ 1 goeswith _ _ 3 less less ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _
If used with parameter keep_lemma=1, the result is:: # text = Never the less, I agree. 1 Never nevertheless ADV _ _ 6 advmod _ _ 2 the _ ADV _ _ 1 goeswith _ _ 3 less _ ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _
Block ud.Google2ud for converting Google Universal Dependency Treebank into UD.
Usage: udapy -s ud.Google2ud < google.conllu > ud2.conllu
-
class
udapi.block.ud.google2ud.
Google2ud
(lang='unk', non_mwt_langs='ar en ja ko zh', **kwargs)[source]¶ Bases:
udapi.block.ud.convert1to2.Convert1to2
Convert Google Universal Dependency Treebank into UD style.
-
fix_deprel
(node)[source]¶ Convert Google dependency relations to UD deprels.
Change topology where needed.
-
static
fix_feats
(node)[source]¶ Remove language prefixes, capitalize names and values, apply FEATS_CHANGE.
-
fix_goeswith
(node)[source]¶ Solve deprel=goeswith which is almost always wrong in the Google annotation.
-
fix_multiword_prep
(node)[source]¶ Solve pobj/pcomp depending on pobj/pcomp.
Only some of these cases are multi-word prepositions (which should get deprel=fixed).
-
process_tree
(root)[source]¶ Apply all the changes on the current tree.
This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.
-
Block ud.JoinAsMwt for creating multi-word tokens
if multiple neighboring words are not separated by a space and the boundaries between the word forms are alphabetical.
-
class
udapi.block.ud.joinasmwt.
JoinAsMwt
(revert_orig_form=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Create MWTs if words are not separated by a space..
Block MarkBugs for checking suspicious/wrong constructions in UD v2.
See http://universaldependencies.org/release_checklist.html#syntax and http://universaldependencies.org/svalidation.html IMPORTANT: the svalidation.html overview is not generated by this code, but by SETS-search-interface rules, which may give different results than this code.
Usage: udapy -s ud.MarkBugs < in.conllu > marked.conllu 2> log.txt
Errors are both logged to stderr and marked within the nodes’ MISC field, e.g. node.misc[‘Bug’] = ‘aux-chain’, so the output conllu file can be searched for “Bug=” occurences.
Author: Martin Popel based on descriptions at http://universaldependencies.org/svalidation.html
-
class
udapi.block.ud.markbugs.
MarkBugs
(save_stats=True, tests=None, skip=None, max_cop_lemmas=2, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for checking suspicious/wrong constructions in UD v2.
Block ud.RemoveMwt for removing multi-word tokens.
-
class
udapi.block.ud.removemwt.
RemoveMwt
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Substitute MWTs with one word representing the whole MWT.
Block SetSpaceAfter for heuristic setting of SpaceAfter=No.
Usage: udapy -s ud.SetSpaceAfter < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.setspaceafter.
SetSpaceAfter
(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for heuristic setting of the SpaceAfter=No MISC attribute.
-
static
is_goeswith_exception
(node)[source]¶ Is this node excepted from SpaceAfter=No because of the goeswith deprel?
Deprel=goeswith means that a space was (incorrectly) present in the original text, so we should not add SpaceAfter=No in these cases. We expect valid annotation of goeswith (no gaps, first token as head).
-
static
Block SetSpaceAfterFromText for setting of SpaceAfter=No according to the sentence text.
Usage: udapy -s ud.SetSpaceAfterFromText < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.setspaceafterfromtext.
SetSpaceAfterFromText
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Block for setting of the SpaceAfter=No MISC attribute according to the sentence text.
Block ud.SplitUnderscoreTokens splits tokens with underscores are attaches them using flat.
Usage: udapy -s ud.SplitUnderscoreTokens < in.conllu > fixed.conllu
Author: Martin Popel
-
class
udapi.block.ud.splitunderscoretokens.
SplitUnderscoreTokens
(deprel=None, default_deprel='flat', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Block for spliting tokens with underscores and attaching the new nodes using deprel=flat.
E.g.:: 1 Hillary_Rodham_Clinton Hillary_Rodham_Clinton PROPN xpos 0 dep
is transformed into: 1 Hillary Hillary PROPN xpos 0 dep 2 Rodham Rodham PROPN xpos 1 flat 3 Clinton Clinton PROPN xpos 1 flat
Real-world use cases: UD_Irish (default_deprel=fixed) and UD_Czech-CLTT v1.4.
-
deprel_for
(node)[source]¶ Return deprel of the newly created nodes: flat, fixed, compound or its subtypes.
See http://universaldependencies.org/u/dep/flat.html http://universaldependencies.org/u/dep/fixed.html http://universaldependencies.org/u/dep/compound.html Note that unlike the first two, deprel=compound does not need to be head-initial.
This method implements a coarse heuristic rules to decide between fixed and flat.
-
Eval is a special block for evaluating code given by parameters.
-
class
udapi.block.util.eval.
Eval
(doc=None, bundle=None, tree=None, node=None, start=None, end=None, before_doc=None, after_doc=None, before_bundle=None, after_bundle=None, expand_code=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Special block for evaluating code given by parameters.
Tricks: pp is a shortcut for pprint.pprint. $. is a shortcut for this. which is a shortcut for node., tree. etc. depending on context. count_X is a shortcut for self.count[X] where X is any string (S+) and self.count is a collections.Counter() instance. Thus you can use code like
util.Eval node=’count_$.upos +=1; count_”TOTAL” +=1’ end=”pp(self.count)”
Filter is a special block for keeping/deleting subtrees specified by parameters.
-
class
udapi.block.util.filter.
Filter
(delete_tree=None, delete_tree_if_node=None, delete_subtree=None, keep_tree=None, keep_tree_if_node=None, keep_subtree=None, mark=None, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Special block for keeping/deleting subtrees specified by parameters.
Example usage from command line: # extract subtrees governed by nouns (noun phrases) udapy -s util.Filter keep_subtree=’node.upos == “NOUN”’ < in.conllu > filtered.conllu
# keep only trees which contain ToDo|Bug nodes udapy -s util.Filter keep_tree_if_node=’re.match(“ToDo|Bug”, str(node.misc))’ < in > filtered
# keep only non-projective trees, annotate non-projective edges with Mark=nonproj and show. udapy -T util.Filter keep_tree_if_node=’node.is_nonprojective()’ mark=nonproj < in | less -R
# delete trees which contain deprel=remnant udapy -s util.Filter delete_tree_if_node=’node.deprel == “remnant”’ < in > filtered
# delete subtrees headed by a node with deprel=remnant udapy -s util.Filter delete_subtree=’node.deprel == “remnant”’ < in > filtered
Block util.FindBug for debugging.
Usage:
If block xy.Z fails with a Python exception,
insert “util.FindBug block=” into the scenario,
e.g. to debug second.Block
, use
udapy first.Block util.FindBug block=second.Block > bug.conllu
This will create the file bug.conllu with the bundle, which caused the bug.
-
class
udapi.block.util.findbug.
FindBug
(block, first_error_only=True, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
Debug another block by finding a minimal testcase conllu file.
util.Mark is a special block for marking nodes specified by parameters.
-
class
udapi.block.util.mark.
Mark
(node, mark=1, add=True, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Mark nodes specified by parameters.
Example usage from command line:: # see non-projective trees with non-projective edges highlighted udapy -TM util.Mark node=’node.is_nonprojective()’ < in | less -R
util.MarkDiff is a special block for marking differences between parallel trees.
-
class
udapi.block.util.markdiff.
MarkDiff
(gold_zone, attributes='form, lemma, upos, xpos, deprel, feats, misc', mark=1, add=False, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Mark differences between parallel trees.
util.ResegmentGold is a block for sentence alignment and re-segmentation of two zones.
-
class
udapi.block.util.resegmentgold.
ResegmentGold
(gold_zone='gold', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Sentence-align two zones (gold and pred) and resegment the pred zone.
The two zones must contain the same sequence of characters.
-
static
choose_root
(p_tree, g_tree)[source]¶ Prevent multiple roots, which are forbidden in the evaluation script.
-
static
Block util.See prints statistics about the nodes matching a given condition.
Example usage from the command line:
udapy util.See node=’node.is_nonprojective()’ n=3 stats=dir,children,c_upos,p_lemma,deprel,feats_split < in.conllu
Example output:
node.is_nonprojective() matches 245 out of 35766 nodes (0.7%) in 174 out of 1478 trees (11.8%) === dir (2 values) ===
- right 193 78% delta=+37%
- left 52 21% delta=-33%
- === children (9 values) ===
- 0 64 26% delta=-38% 2 58 23% delta=+14% 3 38 15% delta= +7%
- === c_upos (15 values) ===
- NOUN 118 23% delta= +4%
- DET 61 12% delta= -3%
PROPN 47 9% delta= +1%
- === p_lemma (187 values) ===
- il 5 2% delta= +1%
- fonction 4 1% delta= +1%
- écrire 4 1% delta= +1%
- === deprel (22 values) ===
- appos 41 16% delta=+15%
- conj 41 16% delta=+13%
punct 36 14% delta= +4%
- === feats_split (20 values) ===
Number=Sing 114 21% delta= +2% Gender=Masc 81 15% delta= +3%
_ 76 14% delta= -6%
In addition to absolute counts for each value, the percentage within matching nodes is printed and a delta relative to percentage within all nodes. This helps to highlight what is special about the matching nodes.
-
class
udapi.block.util.see.
See
(node, n=5, stats='dir, edge, depth, children, siblings, p_upos, p_lemma, c_upos, form, lemma, upos, deprel, feats_split', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Print statistics about the nodes specified by the parameter node.
util.Split is a special block for splitting documents.
-
class
udapi.block.util.split.
Split
(parts=None, bundles_per_doc=None, **kwargs)[source]¶ Bases:
udapi.core.basereader.BaseReader
Split Udapi document (with sentence-aligned trees in bundles) into several parts.
-
static
is_multizone_reader
()[source]¶ Can this reader read bundles which contain more zones?.
This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).
-
static
Wc is a special block for printing statistics (word count etc).
Conllu class is a a writer of files in the CoNLL-U format.
-
class
udapi.block.write.conllu.
Conllu
(print_sent_id=True, print_text=True, print_empty_trees=True, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of files in the CoNLL-U format.
Html class is a writer for HTML+JavaScript+SVG visualization of dependency trees.
-
class
udapi.block.write.html.
Html
(path_to_js='web', **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer for HTML+JavaScript+SVG visualization of dependency trees.
# from the command line udapy write.Html < file.conllu > file.html firefox file.html
For offline use, we need to download first three JavaScript libraries:
wget https://code.jquery.com/jquery-2.1.4.min.js wget https://cdn.rawgit.com/eligrey/FileSaver.js/master/FileSaver.min.js wget https://cdn.rawgit.com/ufal/js-treex-view/gh-pages/js-treex-view.js udapy write.Html path_to_js=. < file.conllu > file.html firefox file.html
This writer produces an html file with drawings of the dependency trees in the document (there are buttons for selecting which bundle will be shown). Under each node its form, upos and deprel are shown. In the tooltip its lemma and (morphological) features are shown. After clicking the node, all other attributes are shown. When hovering over a node, the respective word in the (plain text) sentence is highlighted. There is a button for downloading trees as SVG files.
Three JavaScript libraries are required (jquery, FileSaver and js-treex-view). By default they are linked online (so Internet access is needed when viewing), but they can be also downloaded locally (so offline browsing is possible and the loading is faster): see the Usage example above.
This block is based on Treex::View but takes a different approach. Treex::View depends on (older version of) Valence (Perl interface to Electron) and comes with a script view-treex, which takes a treex file, converts it to json behind the scenes (which is quite slow) and displays the json in a Valence window.
This block generates the json code directly to the html file, so it can be viewed with any browser or even published online. (Most of the html file is actually the json.)
When viewing the html file, the JavaScript library js-treex-view generates an svg on the fly from the json.
Sdparse class is a writer for Stanford dependencies format.
-
class
udapi.block.write.sdparse.
Sdparse
(print_upos=True, print_feats=False, always_ord=False, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of files in the Stanford dependencies format, suitable for Brat visualization.
Usage:
udapy write.Sdparse print_upos=0 < in.conllu
Example output:
~~~ sdparse Corriere Sport da pagina 23 a pagina 26 name(Corriere, Sport) case(pagina-4, da) nmod(Corriere, pagina-4) nummod(pagina-4, 23) case(pagina-7, a) nmod(Corriere, pagina-7) nummod(pagina-7, 26) ~~~
To visualize it, use embedded Brat, e.g. go to http://universaldependencies.org/visualization.html#editing. Click the edit button and paste the output of this writer excluding the ~~~ marks.
Notes: The original Stanford dependencies format allows explicit specification of the root dependency, e.g. root(ROOT-0, makes-8). However, this is not allowed by Brat, so this writer does not print it.
UD v2.0 allows tokens with spaces, but I am not aware of any Brat support.
Alternatives:
- write.Conllu Brat recently supports also the CoNLL-U input
- write.TextModeTrees may be more readable/useful in some usecases
- write.Html dtto, press “Save as SVG” button, convert to pdf
Sentences class is a writer for plain-text sentences.
-
class
udapi.block.write.sentences.
Sentences
(if_missing='detokenize', **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of plain-text sentences (one per line).
Usage: udapy write.Sentences if_missing=empty < my.conllu > my.txt
An ASCII pretty printer of dependency trees.
-
class
udapi.block.write.textmodetrees.
TextModeTrees
(print_sent_id=True, print_text=True, add_empty_line=True, indent=1, minimize_cross=True, color='auto', attributes='form, upos, deprel', print_undef_as='_', print_doc_meta=True, print_comments=False, mark='ToDo|ToDoOrigText|Bug|Mark', marked_only=False, hints=True, **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
An ASCII pretty printer of dependency trees.
# from the command line (visualize CoNLL-U files) udapy write.TextModeTrees color=1 < file.conllu | less -R
In scenario (examples of other parameters):
write.TextModeTrees indent=1 print_sent_id=1 print_sentence=1 write.TextModeTrees zones=en,cs attributes=form,lemma,upos minimize_cross=0
This block prints dependency trees in plain-text format. For example the following CoNLL-U file (with tabs instead of spaces):
1 I I PRON PRP Number=Sing|Person=1 2 nsubj _ _ 2 saw see VERB VBD Tense=Past 0 root _ _ 3 a a DET DT Definite=Ind 4 det _ _ 4 dog dog NOUN NN Number=Sing 2 dobj _ _ 5 today today NOUN NN Number=Sing 2 nmod:tmod _ SpaceAfter=No 6 , , PUNCT , _ 2 punct _ _ 7 which which DET WDT PronType=Rel 10 nsubj _ _ 8 was be VERB VBD Person=3|Tense=Past 10 cop _ _ 9 a a DET DT Definite=Ind 10 det _ _ 10 boxer boxer NOUN NN Number=Sing 4 acl:relcl _ SpaceAfter=No 11 . . PUNCT . _ 2 punct _ _
will be printed (with the default parameters) as:
─┮ │ ╭─╼ I PRON nsubj ╰─┾ saw VERB root │ ╭─╼ a DET det ├────────────────────────┾ dog NOUN dobj ├─╼ today NOUN nmod:tmod │ ├─╼ , PUNCT punct │ │ │ ╭─╼ which DET nsubj │ │ ├─╼ was VERB cop │ │ ├─╼ a DET det │ ╰─┶ boxer NOUN acl:relcl ╰─╼ . PUNCT punct
Some non-projective trees cannot be printed witout crossing edges. TextModeTrees uses a special “bridge” symbol ─╪─ to mark this:
─┮ │ ╭─╼ 1 ├─╪───┮ 2 ╰─┶ 3 │ ╰─╼ 4
By default parameter
color=auto
, so if the output is printed to the console (not file or pipe), each node attribute is printed in different color. If a given node’s MISC contains any of ToDo, Bug or Mark attributes (or any other specified in the parameter mark), the node will be highlighted (by reveresing the background and foreground colors).This block’s method process_tree can be called on any node (not only root), which is useful for printing subtrees using
node.print_subtree()
, which is internally implemented using this block.SEE ALSO
TextModeTreesHtml
-
before_process_document
(document)[source]¶ Initialize ANSI colors if color is True or ‘auto’.
If color==’auto’, detect if sys.stdout is interactive (terminal, not redirected to a file).
-
An ASCII pretty printer of colored dependency trees in HTML.
-
class
udapi.block.write.textmodetreeshtml.
TextModeTreesHtml
(color=True, title='Udapi visualization', **kwargs)[source]¶ Bases:
udapi.block.write.textmodetrees.TextModeTrees
An ASCII pretty printer of colored dependency trees in HTML.
SYNOPSIS # from command line (visualize CoNLL-U files) udapy write.TextModeTreesHtml < file.conllu > file.html
This block is a subclass of TextModeTrees, see its documentation for more info.
-
before_process_document
(document)[source]¶ Initialize ANSI colors if color is True or ‘auto’.
If color==’auto’, detect if sys.stdout is interactive (terminal, not redirected to a file).
-
Tikz class is a writer for LaTeX with tikz-dependency.
-
class
udapi.block.write.tikz.
Tikz
(print_sent_id=True, print_text=True, print_preambule=True, attributes='form, upos', **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of files in the LaTeX with tikz-dependency format.
Usage:
udapy write.Tikz < my.conllu > my.tex pdflatex my.tex xdg-open my.pdf
Long sentences may result in too large pictures. You can tune the width (in addition to changing fontsize or using minipage and rescaling) with
\begin{deptext}[column sep=0.2cm]
or individually for each word:My \&[.5cm] dog \& etc.
By default, the height of the horizontal segment of a dependency edge is proportional to the distance between the linked words. You can tune the height with:\depedge[edge unit distance=1.5ex]{9}{1}{deprel}
See tikz-dependency documentation for details.
Alternatives: * use write.TextModeTrees and include it in verbatim environment in LaTeX. * use write.Html, press “Save as SVG” button, convert to pdf and include in LaTeX.
write.Treex is a writer block for Treex XML (e.g. for TrEd editing).
-
class
udapi.block.write.treex.
Treex
(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of files in the Treex format.
Vislcg class is a writer for the VISL-cg format.
-
class
udapi.block.write.vislcg.
Vislcg
(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]¶ Bases:
udapi.core.basewriter.BaseWriter
A writer of files in the VISL-cg format, suitable for VISL Constraint Grammer Parser.
See https://visl.sdu.dk/visl/vislcg-doc.html
Usage:
udapy write.Vislcg < in.conllu > out.vislcg
Example output:
"<Қыз>" "қыз" n nom @nsubj #1->3 "<оның>" "ол" prn pers p3 sg gen @nmod:poss #2->3 "<қарындасы>" "қарындас" n px3sp nom @parataxis #3->8 "е" cop aor p3 sg @cop #4->3 "<,>" "," cm @punct #5->8 "<ол>" "ол" prn pers p3 sg nom @nsubj #6->8 "<бес>" "бес" num @nummod #7->8 "<жаста>" "жас" n loc @root #8->0 "е" cop aor p3 sg @cop #9->8 "<.>" "." sent @punct #10->8
Example input:
# text = Қыз оның қарындасы, ол бес жаста. 1 Қыз қыз _ n nom 3 nsubj _ _ 2 оның ол _ prn pers|p3|sg|gen 3 nmod:poss _ _ 3-4 қарындасы _ _ _ _ _ _ _ _ 3 қарындасы қарындас _ n px3sp|nom 8 parataxis _ _ 4 _ е _ cop aor|p3|sg 3 cop _ _ 5 , , _ cm _ 8 punct _ _ 6 ол ол _ prn pers|p3|sg|nom 8 nsubj _ _ 7 бес бес _ num _ 8 nummod _ _ 8-9 жаста _ _ _ _ _ _ _ _ 8 жаста жас _ n loc 0 root _ _ 9 _ е _ cop aor|p3|sg 8 cop _ _ 10 . . _ sent _ 8 punct _ _
-
class
udapi.block.zellig_harris.baseline.
Baseline
(args=None)[source]¶ Bases:
udapi.core.block.Block
A block for extraction context configurations for training verb representations using word2vecf.
-
get_word
(node)[source]¶ Format the correct string representation of the given node according to the block settings.
Parameters: node – A input node. Returns: A node’s string representation.
-
-
udapi.block.zellig_harris.common.
get_node_representation
(node, print_lemma=False)[source]¶ Transform the node into the proper textual representation, as will appear in the extracted contexts.
Parameters: - node – An input Node.
- print_lemma – If true, the node lemma is used, otherwise the node form.
Returns: A proper node textual representation for the contexts data.
-
class
udapi.block.zellig_harris.configurations.
Configurations
(args=None)[source]¶ Bases:
udapi.core.block.Block
An abstract class for four extracting scenarios.
-
apply_query
(query_id, node)[source]¶ A generic method for applying a specified query on a specified node.
Parameters: - query_id – A name of the query method to be called.
- node – An input node.
-
-
class
udapi.block.zellig_harris.csnouns.
CsNouns
(args=None)[source]¶ Bases:
udapi.block.zellig_harris.configurations.Configurations
A block for extraction context configurations for Czech nouns. The configurations will be used as the train data for obtaining the word representations using word2vecf.
-
class
udapi.block.zellig_harris.csverbs.
CsVerbs
(args=None)[source]¶ Bases:
udapi.block.zellig_harris.configurations.Configurations
A block for extraction context configurations for Czech verbs. The configurations will be used as the train data for obtaining the word representations using word2vecf.
-
class
udapi.block.zellig_harris.enhancedeps.
EnhanceDeps
(zones='all')[source]¶ Bases:
udapi.core.block.Block
Identify new relations between nodes in the dependency tree (an analogy of effective parents/children from PML). Add these new relations into secondary dependencies slot.
-
process_node
(node)[source]¶ Enhance secondary dependencies by application of the following rules: 1. when the current node A has a deprel ‘conj’ to its parent B,
create a new secondary dependence (B.parent, B.deprel) to A- when the current node A has a deprel ‘conj’ to its parent B, look at B.children C when C.deprel is in {subj, subjpass, iobj, dobj, compl} and there is no A.children D such that C.deprel == D.deprel, add a new secondary dependence (A, C.deprel) to C
Parameters: node – A node to be process.
-
-
udapi.block.zellig_harris.enhancedeps.
echildren
(node)[source]¶ Return a list with node’s effective children.
Parameters: node – An input node. Returns: A list with node’s effective children. Return type: list
-
udapi.block.zellig_harris.enhancedeps.
enhance_deps
(node, new_dependence)[source]¶ Add a new dependence to the node.deps, but firstly check if there is no such dependence already.
Parameters: - node – A node to be enhanced.
- new_dependence – A new dependence to be add into node.deps.
-
udapi.block.zellig_harris.enhancedeps.
eparent
(node)[source]¶ Return an effective parent for the given node.
The rule for the effective parent - when the current node A has a deprel ‘conj’ to its parent B, return B.parent, otherwise return A.parent.
Parameters: node – An input node. Returns: An effective parent. Return type: udapi.core.node.Node
-
class
udapi.block.zellig_harris.ennouns.
EnNouns
(args=None)[source]¶ Bases:
udapi.block.zellig_harris.configurations.Configurations
A block for extraction context configurations for English nouns.
The configurations will be used as the train data for obtaining the word representations using word2vecf.
-
class
udapi.block.zellig_harris.enverbs.
EnVerbs
(args=None)[source]¶ Bases:
udapi.block.zellig_harris.configurations.Configurations
A block for extraction context configurations for English verbs.
The configurations will be used as the train data for obtaining the word representations using word2vecf.
BaseReader is the base class for all reader blocks.
-
class
udapi.core.basereader.
BaseReader
(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Base class for all reader blocks.
-
file_number
¶ Property with the current file number (1-based).
-
filehandle
¶ Property with the current file handle.
-
filename
¶ Property with the current filename.
-
filtered_read_tree
()[source]¶ Load and return one more tree matching the sent_id_filter.
This method uses read_tree() internally. This is the method called by process_document.
-
static
is_multizone_reader
()[source]¶ Can this reader read bundles which contain more zones?.
This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).
-
BaseWriter is the base class for all writer blocks.
-
class
udapi.core.basewriter.
BaseWriter
(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Base class for all reader blocks.
-
file_number
¶ Property with the current file number (1-based).
-
filename
¶ Property with the current filehandle.
-
Block class represents the basic Udapi processing unit.
Bundle class represents one sentence.
-
class
udapi.core.bundle.
Bundle
(bundle_id=None, document=None)[source]¶ Bases:
object
Bundle represents one sentence in an UD document.
A bundle contains one or more trees. More trees are needed e.g. in case of parallel treebanks where each tree represents a translation of the sentence in a different languages. Trees in one bundle are distinguished by a zone label.
-
bundle_id
¶ ID of this bundle.
-
number
¶
-
trees
¶
-
Document class is a container for UD trees.
DualDict is a dict with lazily synchronized string representation.
-
class
udapi.core.dualdict.
DualDict
(value=None, **kwargs)[source]¶ Bases:
collections.abc.MutableMapping
DualDict class serves as dict with lazily synchronized string representation.
>>> ddict = DualDict('Number=Sing|Person=1') >>> ddict['Case'] = 'Nom' >>> str(ddict) 'Case=Nom|Number=Sing|Person=1' >>> ddict['NonExistent'] ''
This class provides access to both * a structured (dict-based, deserialized) representation,
e.g. {‘Number’: ‘Sing’, ‘Person’: ‘1’}, and- a string (serialized) representation of the mapping, e.g. Number=Sing|Person=1.
There is a clever mechanism that makes sure that users can read and write both of the representations which are always kept synchronized. Moreover, the synchronization is lazy, so the serialization and deserialization is done only when needed. This speeds up scenarios where access to dict is not needed.
A value can be deleted with any of the following three ways: >>> del ddict[‘Case’] >>> ddict[‘Case’] = None >>> ddict[‘Case’] = ‘’ and it works even if the value was already missing.
-
set_mapping
(value)[source]¶ Set the mapping from a dict or string.
If the value is None or an empty string, it is converted to storing string _ (which is the CoNLL-U way of representing an empty value). If the value is a string, it is stored as is. If the value is a dict (or any instance of collections.abc.Mapping), its copy is stored. Other types of value raise an ValueError exception.
Feats class for storing morphological features of nodes in UD trees.
-
class
udapi.core.feats.
Feats
(value=None, **kwargs)[source]¶ Bases:
udapi.core.dualdict.DualDict
Feats class for storing morphological features of nodes in UD trees.
See http://universaldependencies.org/u/feat/index.html for the specification of possible feature names and values.
Files is a helper class for iterating over filenames.
-
class
udapi.core.files.
Files
(filenames=None, filehandle=None, encoding='utf-8')[source]¶ Bases:
object
Helper class for iterating over filenames.
It is used e.g. in
udapi.core.basereader
(as self.files = Files(filenames=pattern)). Constructor takes various arguments: >>> files = Files([‘file1.txt’, ‘file2.txt’]) # list of filenames or >>> files = Files(‘file1.txt,file2.txt’) # comma- or space-separated filenames in string >>> files = Files(‘file1.txt,file2.txt.gz’) # supports automatic decompression of gz, xz, bz2 >>> files = Files(‘@my.filelist !dir??/file*.txt’) # @ marks filelist, ! marks wildcard pattern The @filelist and !wildcard conventions are used in several other tools, e.g. 7z or javac.Usage: >>> while (True): >>> filename = files.next_filename()
- if filename is None:
- break
…
or >>> filehandle = files.next_filehandle()
-
filename
¶ Property with the current file name.
-
next_filehandle
()[source]¶ Go to the next file and retrun its filehandle or None (meaning no more files).
-
next_filename
()[source]¶ Go to the next file and retrun its filename or None (meaning no more files).
-
number_of_files
¶ Propery with the total number of files.
-
string_to_filenames
(string)[source]¶ Parse a pattern string (e.g. ‘!dir??/file*.txt’) and return a list of matching filenames.
If the string starts with ! it is interpreted as shell wildcard pattern. If it starts with @ it is interpreted as a filelist with one file per line. The string can contain more filenames (or ‘!’ and ‘@’ patterns) separated by spaces or commas. For specifying files with spaces or commas in filenames, you need to use wildcard patterns or ‘@’ filelist. (But preferably don’t use such filenames.)
MWT class represents a multi-word token.
Node class and related classes and functions.
In addition to class Node, this module contains class ListOfNodes and function find_minimal_common_treelet.
-
class
udapi.core.node.
ListOfNodes
(iterable, origin)[source]¶ Bases:
list
Helper class for results of node.children and node.descendants.
Python distinguishes properties, e.g. node.form … no brackets, and methods, e.g. node.remove() … brackets necessary. It is useful (and expected by Udapi users) to use properties, so one can do e.g. node.form += “suffix”. It is questionable whether node.parent, node.root, node.children etc. should be properties or methods. The problem of methods is that if users forget the brackets, the error may remain unnoticed because the result is interpreted as a method reference. The problem of properties is that they cannot have any parameters. However, we would like to allow e.g. node.children(add_self=True).
This class solves the problem: node.children and node.descendants are properties which return instances of this clas ListOfNodes. This class implements the method __call__, so one can use e.g. nodes = node.children nodes = node.children() nodes = node.children(add_self=True, following_only=True)
-
class
udapi.core.node.
Node
(form=None, lemma=None, upos=None, xpos=None, feats=None, deprel=None, misc=None)[source]¶ Bases:
object
Class for representing nodes in Universal Dependency trees.
Attributes form, lemma, upos, xpos and deprel are public attributes of type str, so you can use e.g. node.lemma = node.form.
node.ord is a int type public attribute for storing the node’s word order index, but assigning to it should be done with care, so the non-root nodes have ord`s 1,2,3… It is recommended to use one of the `node.shift_* methods for reordering nodes.
For changing dependency structure (topology) of the tree, there is the parent property, e.g. node.parent = node.parent.parent and node.create_child() method. Properties node.children and node.descendants return object of type ListOfNodes, so it is possible to do e.g. >>> all_children = node.children >>> left_children = node.children(preceding_only=True) >>> right_descendants = node.descendants(following_only=True, add_self=True)
Properties node.feats and node.misc return objects of type DualDict, so one can do e.g.: >>> node = Node() >>> str(node.feats) ‘_’ >>> node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’}` >>> node.feats = ‘Case=Nom|Person=1’ # equivalent to the above >>> node.feats[‘Case’] ‘Nom’ >>> node.feats[‘NonExistent’] ‘’ >>> node.feats[‘Case’] = ‘Gen’ >>> str(node.feats) ‘Case=Gen|Person=1’ >>> dict(node.feats) {‘Case’: ‘Gen’, ‘Person’: ‘1’}
Handling of enhanced dependencies, multi-word tokens and other node’s methods are described below.
-
address
()[source]¶ Return full (document-wide) id of the node.
For non-root nodes, the general address format is: node.bundle.bundle_id + ‘/’ + node.root.zone + ‘#’ + node.ord, e.g. s123/en_udpipe#4. If zone is empty, the slash is excluded as well, e.g. s123#4.
-
children
¶ Return a list of dependency children (direct dependants) nodes.
The returned nodes are sorted by their ord. Note that node.children is a property, not a method, so if you want all the children of a node (excluding the node itself), you should not use node.children(), but just
node.children- However, the returned result is a callable list, so you can use
- nodes1 = node.children(add_self=True) nodes2 = node.children(following_only=True) nodes3 = node.children(preceding_only=True) nodes4 = node.children(preceding_only=True, add_self=True)
- as a shortcut for
- nodes1 = sorted([node] + node.children, key=lambda n: n.ord) nodes2 = [n for n in node.children if n.ord > node.ord] nodes3 = [n for n in node.children if n.ord < node.ord] nodes4 = [n for n in node.children if n.ord < node.ord] + [node]
See documentation of ListOfNodes for details.
-
compute_text
(use_mwt=True)[source]¶ Return a string representing this subtree’s text (detokenized).
Compute the string by concatenating forms of nodes (words and multi-word tokens) and joining them with a single space, unless the node has SpaceAfter=No in its misc. If called on root this method returns a string suitable for storing in root.text (but it is not stored there automatically).
Technical details: If called on root, the root’s form (<ROOT>) is not included in the string. If called on non-root nodeA, nodeA’s form is included in the string, i.e. internally descendants(add_self=True) is used. Note that if the subtree is non-projective, the resulting string may be misleading.
Args: use_mwt: consider multi-word tokens? (default=True)
-
deprel
¶
-
deps
¶ Return enhanced dependencies as a Python list of dicts.
After the first access to the enhanced dependencies, provide the deserialization of the raw data and save deps to the list.
-
descendants
¶ Return a list of all descendants of the current node.
The returned nodes are sorted by their ord. Note that node.descendants is a property, not a method, so if you want all the descendants of a node (excluding the node itself), you should not use node.descendants(), but just
node.descendants- However, the returned result is a callable list, so you can use
- nodes1 = node.descendants(add_self=True) nodes2 = node.descendants(following_only=True) nodes3 = node.descendants(preceding_only=True) nodes4 = node.descendants(preceding_only=True, add_self=True)
- as a shortcut for
- nodes1 = sorted([node] + node.descendants, key=lambda n: n.ord) nodes2 = [n for n in node.descendants if n.ord > node.ord] nodes3 = [n for n in node.descendants if n.ord < node.ord] nodes4 = [n for n in node.descendants if n.ord < node.ord] + [node]
See documentation of ListOfNodes for details.
-
feats
¶ Property for morphological features stored as a Feats object.
Reading: You can access node.feats as a dict, e.g. if node.feats[‘Case’] == ‘Nom’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.feats[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole FEATS (suitable for CoNLL-U), e.g. if node.feats == ‘Case=Nom|Person=1’.
Writing: All the following assignment types are supported: node.feats[‘Case’] = ‘Nom’ node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’} node.feats = ‘Case=Nom|Person=1’ node.feats = ‘_’ The last line has the same result as assigning None or empty string to node.feats.
For details about the implementation and other methods (e.g. node.feats.is_plural()), see
udapi.core.feats.Feats
which is a subclass of DualDict.
-
form
¶
-
get_attrs
(attrs, undefs=None, stringify=True)[source]¶ Return multiple attributes or pseudo-attributes, possibly substituting empty ones.
Pseudo-attributes: p_xy is the (pseudo) attribute xy of the parent node. c_xy is a list of the (pseudo) attributes xy of the children nodes. l_xy is the (pseudo) attribute xy of the previous (left in LTR langs) node. r_xy is the (pseudo) attribute xy of the following (right in LTR langs) node. dir: ‘left’ = the node is a left child of its parent,
‘right’ = the node is a rigth child of its parent, ‘root’ = the node’s parent is the technical root.edge: length of the edge to parent (node.ord - node.parent.ord) or 0 if parent is root children: number of children nodes. siblings: number of siblings nodes. depth: depth in the dependency tree (technical root has depth=0, highest word has depth=1). feats_split: list of name=value formatted strings of the FEATS.
Args: attrs: A list of attribute names, e.g.
['form', 'lemma', 'p_upos']
. undefs: A value to be used instead of None for empty (undefined) values. stringify: Apply str() on each value (except for None)
-
is_nonprojective
()[source]¶ Is the node attached to its parent non-projectively?
Is there at least one node between (word-order-wise) this node and its parent that is not dominated by the parent? For higher speed, the actual implementation does not find the node(s) which cause(s) the gap. It only checks the number of parent’s descendants in the span and the total number of nodes in the span.
-
is_nonprojective_gap
()[source]¶ Is the node causing a non-projective gap within another node’s subtree?
Is there at least one node X such that - this node is not a descendant of X, but - this node is within span of X, i.e. it is between (word-order-wise)
X’s leftmost descendant (or X itself) and X’s rightmost descendant (or X itself).
-
static
is_root
()[source]¶ Is the current node a (technical) root?
Returns False for all Node instances, irrespectively of whether is has a parent or not. True is returned only by instances of udapi.core.root.Root.
-
lemma
¶
-
misc
¶ Property for MISC attributes stored as a DualDict object.
Reading: You can access node.misc as a dict, e.g. if node.misc[‘SpaceAfter’] == ‘No’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.misc[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole MISC (suitable for CoNLL-U), e.g. if node.misc == ‘SpaceAfter=No|X=Y’.
Writing: All the following assignment types are supported: node.misc[‘SpaceAfter’] = ‘No’ node.misc = {‘SpaceAfter’: ‘No’, ‘X’: ‘Y’} node.misc = ‘SpaceAfter=No|X=Y’ node.misc = ‘_’ The last line has the same result as assigning None or empty string to node.feats.
For details about the implementation, see
udapi.core.dualdict.DualDict
.
-
multiword_token
¶ Return the multi-word token which includes this node, or None.
If this node represents a (syntactic) word which is part of a multi-word token, this method returns the instance of udapi.core.mwt.MWT. If this nodes is not part of any multi-word token, this method returns None.
-
next_node
¶ Return the following node according to word order.
-
no_space_after
¶ Boolean property as a shortcut for node.misc[“SpaceAfter”] == “No”.
-
ord
¶
-
parent
¶ Return dependency parent (head) node.
-
prev_node
¶ Return the previous node according to word order.
-
print_subtree
(**kwargs)[source]¶ Print ASCII visualization of the dependency structure of this subtree.
This method is useful for debugging. Internally udapi.block.write.textmodetrees.TextModeTrees is used for the printing. All keyword arguments of this method are passed to its constructor, so you can use e.g.: files: to redirect sys.stdout to a file indent: to have wider trees attributes: to override the default list ‘form,upos,deprel’ See TextModeTrees for details and other parameters.
-
raw_deps
¶ String serialization of enhanced dependencies as stored in CoNLL-U files.
After the access to the raw enhanced dependencies, provide the serialization if they were deserialized already.
-
remove
(children=None)[source]¶ Delete this node and all its descendants.
Args: children: a string specifying what to do if the node has any children.
The default (None) is to delete them (and all their descendants). rehang means to re-attach those children to the parent of the removed node. warn means to issue a warning if any children are present and delete them. rehang_warn means to rehang and warn:-).
-
root
¶ Return the (technical) root node of the whole tree.
-
sdeprel
¶ Return the language-specific part of dependency relation.
E.g. if deprel = acl:relcl then sdeprel = relcl. If deprel=`acl` then sdeprel = empty string. If deprel is None then node.sdeprel will return None as well.
-
shift
(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]¶ Internal method for changing word order.
-
shift_after_subtree
(reference_node, without_children=0)[source]¶ Shift this node (and its subtree) after the subtree rooted by reference_node.
Args: without_children: shift just this node without its subtree?
-
shift_before_subtree
(reference_node, without_children=0)[source]¶ Shift this node (and its subtree) before the subtree rooted by reference_node.
Args: without_children: shift just this node without its subtree?
-
udeprel
¶ Return the universal part of dependency relation, e.g. acl instead of acl:relcl.
So you can write node.udeprel instead of node.deprel.split(‘:’)[0].
-
upos
¶
-
xpos
¶
-
-
udapi.core.node.
find_minimal_common_treelet
(*args)[source]¶ Find the smallest tree subgraph containing all nodes provided in args.
>>> from udapi.core.node import find_minimal_common_treelet >>> (nearest_common_ancestor, _) = find_minimal_common_treelet(nodeA, nodeB) >>> nodes = [nodeA, nodeB, nodeC] >>> (nca, added_nodes) = find_minimal_common_treelet(*nodes)
There always exists exactly one such tree subgraph (aka treelet). This function returns a tuple (root, added_nodes), where root is the root of the minimal treelet and added_nodes is an iterator of nodes that had to be added to nodes to form the treelet. The nodes should not contain one node twice.
Utilities for downloading models and ither resources.
Root class represents the technical root node in each tree.
-
class
udapi.core.root.
Root
(zone=None, comment='', text=None, newpar=None, newdoc=None)[source]¶ Bases:
udapi.core.node.Node
Class for representing root nodes (technical roots) in UD trees.
-
address
()[source]¶ Full (document-wide) id of the root.
The general format of root nodes is: root.bundle.bundle_id + ‘/’ + root.zone, e.g. s123/en_udpipe. If zone is empty, the slash is excluded as well, e.g. s123. If bundle is missing (could occur during loading), ‘?’ is used instead. Root’s address is stored in CoNLL-U files as sent_id (in a special comment).
-
bundle
¶ Return the bundle which this tree belongs to.
-
comment
¶
-
create_multiword_token
(words=None, form=None, misc=None)[source]¶ Create and return a new multi-word token (MWT) in this tree.
The new MWT can be optionally initialized using the following args. Args: words: a list of nodes which are part of the new MWT form: string representing the surface form of the new MWT misc: misc attribute of the new MWT
-
descendants
¶ Return a list of all descendants of the current node.
The nodes are sorted by their ord. This root-specific implementation returns all the nodes in the tree except the root itself.
-
empty_nodes
¶
-
get_sentence
(if_missing='detokenize')[source]¶ Return either the stored root.text or (if None) root.compute_text().
Args: if_missing: What to do if root.text is None? (default=detokenize)
- detokenize: use root.compute_text() to compute the sentence.
- empty: return an empty string
- warn_detokenize, warn_empty: in addition emit a warning via logging.warning()
- fatal: raise an exception
-
is_descendant_of
(node)[source]¶ Is the current node a descendant of the node given as argument?
This root-specific implementation returns always False.
-
json
¶
-
multiword_tokens
¶ Return a list of all multi-word tokens in this tree.
-
newdoc
¶
-
newpar
¶
-
parent
¶ Return dependency parent (head) node.
This root-specific implementation returns always None.
-
remove
(children=None)[source]¶ Remove the whole tree from its bundle.
Args: children: a string specifying what to do if the root has any children.
The default (None) is to delete them (and all their descendants). warn means to issue a warning.
-
sent_id
¶ ID of this tree, stored in the sent_id comment in CoNLL-U.
-
shift
(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]¶ Attempts at changing the word order of root result in Exception.
-
text
¶
-
token_descendants
¶ Return all tokens (one-word or multi-word) in the tree.
ie. return a list of core.Node and core.MWT instances, whose forms create the raw sentence. Skip nodes, which are part of multi-word tokens.
For example with: 1-2 vámonos _ 1 vamos ir 2 nos nosotros 3-4 al _ 3 a a 4 el el 5 mar mar
[n.form for n in root.token_descendants] will return [‘vámonos’, ‘al’, ‘mar’].
-
zone
¶ Return zone (string label) of this tree.
-
Class Run parses a scenario and executes it.