Welcome to Udapi’s documentation!

Udapi is a framework providing an API for processing Universal Dependencies data.

Installation

You need Python 3.3 or higher, pip3 and git.

Let’s clone the git repo to ~/udapi-python/, install dependencies and setup $PATH and $PYTHONPATH accordingly:

cd
git clone https://github.com/udapi/udapi-python.git
pip3 install --user -r udapi-python/requirements.txt
echo '## Use Udapi from ~/udapi-python/ ##'                >> ~/.bashrc
echo 'export PATH="$HOME/udapi-python/bin:$PATH"'          >> ~/.bashrc
echo 'export PYTHONPATH="$HOME/udapi-python/:$PYTHONPATH"' >> ~/.bashrc
source ~/.bashrc # or open new bash

API Documentation

udapi package


Sub-modules

udapi

udapi package
Subpackages
udapi.block package
Subpackages
udapi.block.demo package
Submodules
udapi.block.demo.rehangprepositions module

RehangPrepositions demo block.

class udapi.block.demo.rehangprepositions.RehangPrepositions(zones='all')[source]

Bases: udapi.core.block.Block

This block takes all prepositions (upos=ADP) and rehangs them above their parent.

process_node(node)[source]

Process a UD node

Module contents
udapi.block.eval package
Submodules
udapi.block.eval.conll17 module

Block&script eval.Conll17 for evaluating LAS,UAS,etc as in CoNLL2017 UD shared task.

This is a reimplementation of the CoNLL2017 shared task official evaluation script, http://universaldependencies.org/conll17/evaluation.html

The gold trees and predicted (system-output) trees need to be sentence-aligned e.g. using util.ResegmentGold. Unlike in eval.Parsing, the gold and predicted trees can have different tokenization.

An example usage and output:

$ udapy read.Conllu zone=gold files=gold.conllu \
        read.Conllu zone=pred files=pred.conllu ignore_sent_id=1 \
        eval.Conll17
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Words      |     27.91 |     52.17 |     36.36 |    100.00
UPOS       |     27.91 |     52.17 |     36.36 |    100.00
XPOS       |     27.91 |     52.17 |     36.36 |    100.00
Feats      |     27.91 |     52.17 |     36.36 |    100.00
Lemma      |     27.91 |     52.17 |     36.36 |    100.00
UAS        |     16.28 |     30.43 |     21.21 |     58.33
LAS        |     16.28 |     30.43 |     21.21 |     58.33
CLAS       |     10.34 |     16.67 |     12.77 |     37.50

For evaluating multiple systems and testsets (as in CoNLL2017) stored in systems/testset_name/system_name.conllu you can use:

#!/bin/bash
SYSTEMS=`ls systems`
[[ $# -ne 0 ]] && SYSTEMS=$@
set -x
set -e
for sys in $SYSTEMS; do
    mkdir -p results/$sys
    for testset in `ls systems/$sys`; do
        udapy read.Conllu zone=gold files=gold/$testset \
              read.Conllu zone=pred files=systems/$sys/$testset ignore_sent_id=1 \
              util.ResegmentGold \
              eval.Conll17 print_results=0 print_raw=1 \
              > results/$sys/${testset%.conllu}
    done
done
python3 `python3 -c 'import udapi.block.eval.conll17 as x; print(x.__file__)'` -r 100

The last line executes this block as a script and computes bootstrap resampling with 100 resamples (default=1000, it is recommended to keep the default or higher value unless testing the interface). This prints the ranking and confidence intervals (95% by default) and also p-values for each pair of systems with neighboring ranks. If the difference in LAS is significant (according to a paired bootstrap test, by default if p < 0.05), a line is printed between the two systems.

The output looks like:

 1.          Stanford 76.17 ± 0.12 (76.06 .. 76.30) p=0.001
------------------------------------------------------------
 2.              C2L2 74.88 ± 0.12 (74.77 .. 75.01) p=0.001
------------------------------------------------------------
 3.               IMS 74.29 ± 0.13 (74.16 .. 74.43) p=0.001
------------------------------------------------------------
 4.          HIT-SCIR 71.99 ± 0.14 (71.84 .. 72.12) p=0.001
------------------------------------------------------------
 5.           LATTICE 70.81 ± 0.13 (70.67 .. 70.94) p=0.001
------------------------------------------------------------
 6.        NAIST-SATO 70.02 ± 0.13 (69.89 .. 70.16) p=0.001
------------------------------------------------------------
 7.    Koc-University 69.66 ± 0.13 (69.52 .. 69.79) p=0.002
------------------------------------------------------------
 8.   UFAL-UDPipe-1-2 69.36 ± 0.13 (69.22 .. 69.49) p=0.001
------------------------------------------------------------
 9.            UParse 68.75 ± 0.14 (68.62 .. 68.89) p=0.003
------------------------------------------------------------
10.     Orange-Deskin 68.50 ± 0.13 (68.37 .. 68.62) p=0.448
11.          TurkuNLP 68.48 ± 0.14 (68.34 .. 68.62) p=0.029
------------------------------------------------------------
12.              darc 68.29 ± 0.13 (68.16 .. 68.42) p=0.334
13.  conll17-baseline 68.25 ± 0.14 (68.11 .. 68.38) p=0.003
------------------------------------------------------------
14.             MQuni 67.93 ± 0.13 (67.80 .. 68.06) p=0.062
15.             fbaml 67.78 ± 0.13 (67.65 .. 67.91) p=0.283
16.     LyS-FASTPARSE 67.73 ± 0.13 (67.59 .. 67.85) p=0.121
17.        LIMSI-LIPN 67.61 ± 0.14 (67.47 .. 67.75) p=0.445
18.             RACAI 67.60 ± 0.13 (67.46 .. 67.72) p=0.166
19.     IIT-Kharagpur 67.50 ± 0.14 (67.36 .. 67.64) p=0.447
20.           naistCL 67.49 ± 0.15 (67.34 .. 67.63)

TODO: Bootstrap currently reports only LAS, but all the other measures could be added as well.

class udapi.block.eval.conll17.Conll17(gold_zone='gold', print_raw=False, print_results=True, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

Evaluate labeled and unlabeled attachment score (LAS and UAS).

process_end()[source]

A hook method that is executed after processing all UD data

process_tree(tree)[source]

Process a UD tree

udapi.block.eval.conll17.main()[source]
udapi.block.eval.conll17.prec_rec_f1(correct, pred, gold, alig=0)[source]
udapi.block.eval.f1 module

Block eval.F1 for evaluating differences between sentences with P/R/F1.

eval.F1 zones=en_pred gold_zone=en_gold details=0 prints something like:

predicted =     210
gold      =     213
correct   =     210
precision = 100.00%
recall    =  98.59%
F1        =  99.29%

eval.F1 gold_zone=y attributes=form,upos focus='(?i:an?|the)_DET' details=4 prints something like:

=== Details ===
token       pred  gold  corr   prec     rec      F1
the_DET      711   213   188  26.44%  88.26%  40.69%
The_DET       82    25    19  23.17%  76.00%  35.51%
a_DET          0    62     0   0.00%   0.00%   0.00%
an_DET         0    16     0   0.00%   0.00%   0.00%
=== Totals ===
predicted =     793
gold      =     319
correct   =     207
precision =  26.10%
recall    =  64.89%
F1        =  37.23%

This block finds differences between nodes of trees in two zones and reports the overall precision, recall and F1. The two zones are “predicted” (on which this block is applied) and “gold” (which needs to be specified with parameter gold).

This block also reports the number of total nodes in the predicted zone and in the gold zone and the number of “correct” nodes, that is predicted nodes which are also in the gold zone. By default two nodes are considered “the same” if they have the same form, but it is possible to check also for other nodes’ attributes (with parameter attributes).

As usual:

precision = correct / predicted
recall = correct / gold
F1 = 2 * precision * recall / (precision + recall)

The implementation is based on finding the longest common subsequence (LCS) between the nodes in the two trees. This means that the two zones do not need to be explicitly word-aligned.

class udapi.block.eval.f1.F1(gold_zone, attributes='form', focus=None, details=4, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

Evaluate differences between sentences (in different zones) with P/R/F1.

Args: zones: Which zone contains the “predicted” trees?

Make sure that you specify just one zone. If you leave the default value “all” and the document contains more zones, the results will be mixed, which is most likely not what you wanted. Exception: If the document conaints just two zones (predicted and gold trees), you can keep the default value “all” because this block will skip comparison of the gold zone with itself.

gold_zone: Which zone contains the gold-standard trees?

attributes: comma separated list of attributes which should be checked
when deciding whether two nodes are equivalent in LCS
focus: Regular expresion constraining the tokens we are interested in.
If more attributes were specified in the attributes parameter, their values are concatenated with underscore, so focus should reflect that e.g. attributes=form,upos focus='(a|the)_DET'. For case-insensitive focus use e.g. focus='(?i)the' (which is equivalent to focus='[Tt][Hh][Ee]').
details: Print also detailed statistics for each token (matching the focus).
The value of this parameter details specifies the number of tokens to include. The tokens are sorted according to the sum of their predicted and gold counts.
process_end()[source]

A hook method that is executed after processing all UD data

process_tree(tree)[source]

Process a UD tree

udapi.block.eval.f1.find_lcs(x, y)[source]

Find longest common subsequence.

udapi.block.eval.parsing module

Block eval.Parsing for evaluating UAS and LAS - gold and pred must have the same tokens.

class udapi.block.eval.parsing.Parsing(gold_zone, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

Evaluate labeled and unlabeled attachment score (LAS and UAS).

process_end()[source]

A hook method that is executed after processing all UD data

process_tree(tree)[source]

Process a UD tree

Module contents
udapi.block.newspeak package
Submodules
udapi.block.newspeak.prevele module
Module contents
udapi.block.read package
Submodules
udapi.block.read.addsentences module

AddSentences class is a reader for adding plain-text sentences.

class udapi.block.read.addsentences.AddSentences(zone='', into='text', **kwargs)[source]

Bases: udapi.core.basereader.BaseReader

A reader for adding plain-text sentences (one sentence per line) files.

The sentences are added to an existing trees. This is useful, e.g. if there are the original raw texts in a separate file:

cat in.conllu | udapy -s read.Conllu read.AddSentences files=in.txt > merged.conllu

static is_multizone_reader()[source]

Can this reader read bundles which contain more zones?.

This implementation returns always False.

process_document(document)[source]

Process a UD document

udapi.block.read.conllu module

“Conllu is a reader block for the CoNLL-U files.

class udapi.block.read.conllu.Conllu(strict=False, separator='tab', empty_parent='warn', attributes='ord, form, lemma, upos, xpos, feats, head, deprel, deps, misc', **kwargs)[source]

Bases: udapi.core.basereader.BaseReader

A reader of the CoNLL-U files.

static parse_comment_line(line, root)[source]

Parse one line of CoNLL-U and fill sent_id, text, newpar, newdoc in root.

read_tree()[source]

Load one (more) tree from self.files and return its root.

This method must be overriden in all readers. Usually it is the only method that needs to be implemented. The implementation in this base clases raises NotImplementedError.

udapi.block.read.sentences module

Sentences class is a reader for plain-text sentences.

class udapi.block.read.sentences.Sentences(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]

Bases: udapi.core.basereader.BaseReader

A reader for plain-text sentences (one sentence per line) files.

static is_multizone_reader()[source]

Can this reader read bundles which contain more zones?.

This implementation returns always False.

read_tree(document=None)[source]

Load one (more) tree from self.files and return its root.

This method must be overriden in all readers. Usually it is the only method that needs to be implemented. The implementation in this base clases raises NotImplementedError.

udapi.block.read.vislcg module

Vislcg is a reader block the VISL-cg format.

class udapi.block.read.vislcg.Vislcg(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]

Bases: udapi.core.basereader.BaseReader

A reader of the VISL-cg format, suitable for VISL Constraint Grammer Parser.

read_tree()[source]

Load one (more) tree from self.files and return its root.

This method must be overriden in all readers. Usually it is the only method that needs to be implemented. The implementation in this base clases raises NotImplementedError.

Module contents
udapi.block.tokenize package
Submodules
udapi.block.tokenize.onwhitespace module

Block tokenize.OnWhitespace

class udapi.block.tokenize.onwhitespace.OnWhitespace(zones='all')[source]

Bases: udapi.core.block.Block

“Base tokenizer, splits on whitespaces, fills SpaceAfter=No.

process_tree(root)[source]

Process a UD tree

static tokenize_sentence(string)[source]

A method to be overriden in subclasses.

udapi.block.tokenize.simple module

Block tokenize.Simple

class udapi.block.tokenize.simple.Simple(zones='all')[source]

Bases: udapi.block.tokenize.onwhitespace.OnWhitespace

Simple tokenizer, splits on whitespaces and punctuation, fills SpaceAfter=No.

static tokenize_sentence(string)[source]

A method to be overriden in subclasses.

Module contents
udapi.block.transform package
Submodules
udapi.block.transform.deproj module

Block Deproj for deprojectivization of pseudo-projective trees à la Nivre & Nilsson (2005).

See ud.transform.Proj for details. TODO: implement also path and head+path strategies.

class udapi.block.transform.deproj.Deproj(strategy='head', label='misc', **kwargs)[source]

Bases: udapi.core.block.Block

De-projectivize the trees à la Nivre & Nilsson (2005).

head_strategy(node, label)[source]
process_node(node)[source]

Process a UD node

udapi.block.transform.flatten module

transform.Flatten block for flattening trees.

class udapi.block.transform.flatten.Flatten(zones='all')[source]

Bases: udapi.core.block.Block

Apply node.parent = node.root; node.deprel = ‘root’ on all nodes.

process_node(node)[source]

Process a UD node

udapi.block.transform.proj module

Block Proj for (pseudo-)projectivization of trees à la Nivre & Nilsson (2005).

See http://www.aclweb.org/anthology/P/P05/P05-1013.pdf. This block tries to replicate Malt parser’s projectivization: http://www.maltparser.org/userguide.html#singlemalt_proj http://www.maltparser.org/optiondesc.html#pproj-marking_strategy

TODO: implement also path and head+path strategies.

TODO: Sometimes it would be better (intuitively) to lower the gap-node (if its whole subtree is in the gap and if this does not cause more non-projectivities) rather than to lift several nodes whose parent-edge crosses this gap. We would need another label value (usually the lowering is of depth 1), but the advantage is that reconstruction of lowered edges during deprojectivization is simple and needs no heuristics.

class udapi.block.transform.proj.Proj(strategy='head', lifting_order='deepest', label='misc', **kwargs)[source]

Bases: udapi.core.block.Block

Projectivize the trees à la Nivre & Nilsson (2005).

lift(node)[source]
mark(node, label)[source]
nonproj_info(node)[source]
process_tree(tree)[source]

Process a UD tree

Module contents
udapi.block.tutorial package
Submodules
udapi.block.tutorial.addarticles module

tutorial.AddArticles block template.

class udapi.block.tutorial.addarticles.AddArticles(zones='all')[source]

Bases: udapi.core.block.Block

Heuristically insert English articles.

process_node(node)[source]

Process a UD node

udapi.block.tutorial.addcommas module

tutorial.AddCommas block template.

class udapi.block.tutorial.addcommas.AddCommas(zones='all')[source]

Bases: udapi.core.block.Block

Heuristically insert nodes for missing commas.

process_node(node)[source]

Process a UD node

should_add_comma_before(node)[source]
udapi.block.tutorial.adpositions module

tutorial.Adpositions block template.

Example usage:

for a in */sample.conllu; do
   printf '%50s ' $a;
   udapy tutorial.Adpositions < $a;
done | tee results.txt

# What are the English postpositions?
cat UD_English/sample.conllu | udapy -TM util.Mark    node='node.upos == "ADP" and node.parent.precedes(node)' | less -R
class udapi.block.tutorial.adpositions.Adpositions(**kwargs)[source]

Bases: udapi.core.block.Block

Compute the number of prepositions and postpositions.

process_end()[source]

A hook method that is executed after processing all UD data

process_node(node)[source]

Process a UD node

udapi.block.tutorial.parse module

tutorial.Parse block template.

Usage: udapy read.Conllu zone=gold files=sample.conllu read.Conllu zone=pred files=sample.conllu transform.Flatten zones=pred tutorial.Parse zones=pred eval.Parsing gold_zone=gold util.MarkDiff gold_zone=gold write.TextModeTreesHtml marked_only=1 files=parse-diff.html

class udapi.block.tutorial.parse.Parse(zones='all')[source]

Bases: udapi.core.block.Block

Dependency parsing.

process_tree(root)[source]

Process a UD tree

Module contents
udapi.block.ud package
Subpackages
udapi.block.ud.bg package
Submodules
udapi.block.ud.bg.removedotafterabbr module

Block ud.bg.RemoveDotAfterAbbr deletes extra PUNCT nodes after abbreviations.

Usage: udapy -s ud.bg.RemoveDotAfterAbbr < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.bg.removedotafterabbr.RemoveDotAfterAbbr(zones='all')[source]

Bases: udapi.core.block.Block

Block for deleting extra PUNCT nodes after abbreviations.

If an abrreviation is followed by end-sentence period, most languages allow just one period. However, in some treebanks (e.g. UD_Bulgarian v1.4) two periods are annotated:: # text = 1948 г. 1 1948 1948 ADJ 2 г. г. NOUN 3 . . PUNCT

The problem is that the text comment does not match with the word forms. In https://github.com/UniversalDependencies/docs/issues/410 it was decided that the least-wrong solution (and most common in other treebanks) is to delete the end-sentence punctuation:: # text = 1948 г. 1 1948 1948 ADJ 2 г. г. NOUN

This block is not specific for Bulgarian, just that UD_Bulgarian is probably the only treebank where this transformation is needed.

process_tree(root)[source]

Process a UD tree

Module contents
udapi.block.ud.cs package
Submodules
udapi.block.ud.cs.addmwt module

Block ud.cs.AddMwt for heuristic detection of multi-word tokens.

class udapi.block.ud.cs.addmwt.AddMwt(zones='all')[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

postprocess_mwt(mwt)[source]

Optional postprocessing of newly created MWTs.

Module contents
udapi.block.ud.de package
Submodules
udapi.block.ud.de.addmwt module

Block ud.de.AddMwt for heuristic detection of German contractions.

According to the UD guidelines, contractions such as “am” = “an dem” should be annotated using multi-word tokens.

Notice that this should be used only for converting existing conllu files. Ideally a tokenizer should have already split the MWTs.

class udapi.block.ud.de.addmwt.AddMwt(zones='all')[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

Module contents
udapi.block.ud.el package
Submodules
udapi.block.ud.el.addmwt module

Block ud.el.AddMwt for heuristic detection of multi-word (σε+DET) tokens.

Notice that this should be used only for converting existing conllu files. Ideally a tokenizer should have already split the MWTs. Also notice that this block does not deal with the relatively rare PRON(Person=2)+'*+PRON(Person=3, i.e. "σ'το" and "στο") MWTs.

class udapi.block.ud.el.addmwt.AddMwt(zones='all')[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

Module contents
udapi.block.ud.es package
Submodules
udapi.block.ud.es.addmwt module

Block ud.es.AddMwt for heuristic detection of Spanish contractions.

According to the UD guidelines, contractions such as “dele” = “de ele” should be annotated using multi-word tokens.

Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.

class udapi.block.ud.es.addmwt.AddMwt(verbpron=False, **kwargs)[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

postprocess_mwt(mwt)[source]

Optional postprocessing of newly created MWTs.

Module contents
udapi.block.ud.fr package
Submodules
udapi.block.ud.fr.addmwt module

Block ud.fr.AddMwt for heuristic detection of French contractions.

According to the UD guidelines, contractions such as “des” = “de les” should be annotated using multi-word tokens.

Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.

class udapi.block.ud.fr.addmwt.AddMwt(zones='all')[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

postprocess_mwt(mwt)[source]

Optional postprocessing of newly created MWTs.

Module contents
udapi.block.ud.ga package
Submodules
udapi.block.ud.ga.to2 module

Block ud.ga.To2 UD_Irish-specific conversion of UDv1 to UDv2

Author: Martin Popel

class udapi.block.ud.ga.to2.To2(zones='all')[source]

Bases: udapi.core.block.Block

Block for fixing the remaining cases (after ud.Convert1to2) in UD_Irish.

process_node(node)[source]

Process a UD node

Module contents
udapi.block.ud.gl package
Submodules
udapi.block.ud.gl.to2 module

Block ud.gl.To2 UD_Galician-specific conversion of UDv1 to UDv2

Author: Martin Popel

class udapi.block.ud.gl.to2.To2(zones='all')[source]

Bases: udapi.core.block.Block

Block for fixing the remaining cases (before ud.Convert1to2) in UD_Galician.

process_node(node)[source]

Process a UD node

Module contents
udapi.block.ud.he package
Submodules
udapi.block.ud.he.fixneg module

Block ud.he.FixNeg fix remaining deprel=neg

Author: Martin Popel

class udapi.block.ud.he.fixneg.FixNeg(zones='all')[source]

Bases: udapi.core.block.Block

Block for fixing the remaining cases (after ud.Convert1to2) of deprel=neg in UD_Hebrew.

process_node(node)[source]

Process a UD node

Module contents
udapi.block.ud.pt package
Submodules
udapi.block.ud.pt.addmwt module

Block ud.pt.AddMwt for heuristic detection of Portuguese contractions.

According to the UD guidelines, contractions such as “dele” = “de ele” should be annotated using multi-word tokens.

Note that this block should be used only for converting legacy conllu files. Ideally a tokenizer should have already split the MWTs.

class udapi.block.ud.pt.addmwt.AddMwt(zones='all')[source]

Bases: udapi.block.ud.addmwt.AddMwt

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

Module contents
udapi.block.ud.ro package
Submodules
udapi.block.ud.ro.fixneg module

Block ud.ro.FixNeg ad-hoc fixes

Author: Martin Popel

class udapi.block.ud.ro.fixneg.FixNeg(zones='all')[source]

Bases: udapi.core.block.Block

Block for fixing the remaining cases (after ud.Convert1to2) of deprel=neg in UD_Romanian.

process_node(node)[source]

Process a UD node

udapi.block.ud.ro.setspaceafter module

Block ud.ro.SetSpaceAfter for heuristic setting of SpaceAfter=No in Romanian.

Usage:

udapy -s ud.ro.SetSpaceAfter < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.ro.setspaceafter.SetSpaceAfter(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]

Bases: udapi.block.ud.setspaceafter.SetSpaceAfter

Block for heuristic setting of the SpaceAfter=No MISC attribute in Romanian.

Romanian uses many contractions, e.g.

raw meaning tokenized lemmatized
n-ar nu ar n- ar nu avea
să-i să îi să -i să el
într-o în o într- o întru un
nu-i nu îi nu -i nu el
nu-i nu e nu -i nu fi

Detokenization is quite simple: no space after word-final hyphen and before word-initial hyphen. There are just two exceptions, I have found: * “-” the hyphen itself (most probably it means a dash separating phrases/clauses) * negative numbers, e.g. “-3,1”

process_tree(root)[source]

Process a UD tree

Module contents
udapi.block.ud.ru package
Submodules
udapi.block.ud.ru.fixremnant module

Block ud.ru.FixRemnant ad-hoc fixes

Author: Martin Popel

class udapi.block.ud.ru.fixremnant.FixRemnant(zones='all')[source]

Bases: udapi.core.block.Block

ad-hoc fixing the remaining cases (after ud.Convert1to2) of deprel=remnant in UD_Russian.

process_node(node)[source]

Process a UD node

Module contents
Submodules
udapi.block.ud.addmwt module

Abstract base class ud.AddMwt for heuristic detection of multi-word tokens.

class udapi.block.ud.addmwt.AddMwt(zones='all')[source]

Bases: udapi.core.block.Block

Detect and mark MWTs (split them into words and add the words to the tree).

multiword_analysis(node)[source]

Return a dict with MWT info or None if node does not represent a multiword token.

An example return value is:

{
‘form’: ‘aby bych’, ‘lemma’: ‘aby být’, ‘upos’: ‘SCONJ AUX’, ‘xpos’: ‘J,————- Vc-S—1——-‘, ‘feats’: ‘_ Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin’, # _ means empty FEATS ‘deprel’: ‘* aux’, # * means keep the original deprel ‘main’: 0, # which of the two words will inherit the original children (if any) ‘shape’: ‘siblings’, # the newly created nodes will be siblings or alternatively #’shape’: ‘subtree’, # the main-indexed node will be the head

}

postprocess_mwt(mwt)[source]

Optional postprocessing of newly created MWTs.

process_node(node)[source]

Process a UD node

udapi.block.ud.complywithtext module

Block ComplyWithText for adapting the nodes to comply with the text.

Implementation design details: Usually, most of the inconsistencies between tree tokens and the raw text are simple to solve. However, there may be also rare cases when it is not clear how to align the tokens (nodes in the tree) with the raw text (stored in root.text). This block tries to solve the general case using several heuristics.

It starts with running a LCS-like algorithm (LCS = longest common subsequence) difflib.SequenceMatcher on the raw text and concatenation of tokens’ forms, i.e. on sequences of characters (as opposed to running LCS on sequences of tokens).

To prevent mis-alignment problems, we keep the spaces present in the raw text and we insert spaces into the concatenated forms (tree_chars) according to SpaceAfter=No. An example of a mis-alignment problem: text “énfase na necesidade” with 4 nodes “énfase en a necesidade” should be solved by adding multiword token “na” over the nodes “en” and “a”. However, running LCS (or difflib) over the character sequences “énfaseenanecesidade” “énfasenanecesidade” may result in énfase -> énfas.

Author: Martin Popel

class udapi.block.ud.complywithtext.ComplyWithText(fix_text=True, prefer_mwt=True, allow_goeswith=True, max_mwt_length=4, **kwargs)[source]

Bases: udapi.core.block.Block

Adapt the nodes to comply with the text.

static allow_space(form)[source]

Is space allowed within this token form?

merge_diffs(orig_diffs, char_nodes)[source]

Make sure each diff starts on original token boundary.

If not, merge the diff with the previous diff. E.g. (equal, “5”, “5”), (replace, “-6”, “–7”) is changed into (replace, “5-6”, “5–7”)

process_tree(root)[source]

Process a UD tree

solve_diff(nodes, form)[source]

Fix a given (minimal) tokens-vs-text inconsistency.

solve_diffs(diffs, tree_chars, char_nodes, text)[source]
static store_orig_form(node, new_form)[source]

Store the original form of this node into MISC, unless the change is common&expected.

unspace_diffs(orig_diffs, tree_chars, text)[source]
udapi.block.ud.convert1to2 module

Block Convert1to2 for converting UD v1 to UD v2.

See http://universaldependencies.org/v2/summary.html for the description of all UD v2 changes. IMPORTANT: this code does only SOME of the changes and the output should be checked.

Note that this block is not idempotent, i.e. you should not apply it twice on the same data. It should be idempotent when skipping the coordination transformations (skip=coord).

Author: Martin Popel, based on https://github.com/UniversalDependencies/tools/tree/master/v2-conversion by Sebastian Schuster.

class udapi.block.ud.convert1to2.Convert1to2(skip='', save_stats=True, **kwargs)[source]

Bases: udapi.core.block.Block

Block for converting UD v1 to UD v2.

HEAD_PROMOTION = {'advcl': 1, 'advmod': 5, 'ccomp': 2, 'csubj': 4, 'iobj': 7, 'nsubj': 9, 'obj': 8, 'obl': 6, 'xcomp': 3}
after_process_document(document)[source]

Print overall statistics of ToDo counts.

change_deprel_simple(node)[source]

mwe→fixed, dobj→obj, pass→:pass, name→flat, foreign→flat+Foreign=Yes.

change_feats(node)[source]

Negative→Polarity, Aspect=Pro→Prosp, VerbForm=Trans→Conv, Definite=Red→Cons,…

Also Foreign=Foreign→Yes and log if Tense=NarTense=Nar or NumType=GenNumType=Gen is used.

static change_headfinal(node, deprel)[source]

deprel=goeswith|flat|fixed|appos must be a head-initial flat structure.

change_neg(node)[source]

neg→advmod/det/ToDo + Polarity=Neg.

In addition, if there is a node with deprel=neg and upos=INTJ, it is checked whether it is possibly a real interjection or a negation particle, which should have upos=PART (as documented in http://universaldependencies.org/u/pos/PART.html) This kind of error (INTJ instead of PART for “не”) is common e.g. in Bulgarian v1.4, but I hope the rule is language independent (enough to be included here).

change_nmod(node)[source]

nmod→obl if parent is not nominal, but predicate.

static change_upos(node)[source]

CONJ→CCONJ.

static change_upos_copula(node)[source]

deprel=cop needs upos=AUX (or PRON).

fix_remnants_in_tree(root)[source]

Change ellipsis with remnant deprels to UDv2 ellipsis with orphans.

Remnant’s parent is always the correlate (same-role) node. Usually, correlate’s parent is the head of the whole ellipsis subtree, i.e. the first conjunct. However, sometimes remnants are deeper, e.g. ‘Over 300 Iraqis are reported dead and 500 wounded.’ with edges:

nsubjpass(reported, Iraqis)
nummod(Iraqis, 300)
remnant(300, 500)

Let’s expect all remnants in one tree are part of the same ellipsis structure.

TODO: theoretically, there may be more ellipsis structures with remnants in one tree, but I have no idea how to distinguish them from the deeper-remnants cases.

fix_text(root)[source]

Make sure root.text is filled and matching the forms+SpaceAfter=No.

static is_nominal(node)[source]

Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.

Used in change_nmod.

static is_verbal(node)[source]

Returns True for verbs and nodes with copula child.

Used in change_neg.

log(node, short_msg, long_msg)[source]

Log node.address() + long_msg and add ToDo=short_msg to node.misc.

process_tree(tree)[source]

Apply all the changes on the current tree.

This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.

reattach_coordinations(node)[source]

cc and punct in coordinations should depend on the immediately following conjunct.

udapi.block.ud.exgoogle2ud module

Block ud.ExGoogle2ud converts data which were originally annotated in Google style then converted with an older version of ud.Google2ud to UDv2, then manually edited and we don’t want to loose these edits, so we cannot simply rerun the newer version of ud.Google2ud on the original Google data.

class udapi.block.ud.exgoogle2ud.ExGoogle2ud(lang='unk', **kwargs)[source]

Bases: udapi.core.block.Block

Convert former Google Universal Dependency Treebank into UD style.

fix_node(node)[source]

Various fixed taken from ud.Google2ud.

static is_nominal(node)[source]

Returns ‘no’ (for predicates), ‘yes’ (sure nominals) or ‘maybe’.

Used in change_nmod.

process_tree(root)[source]

Process a UD tree

udapi.block.ud.fixchain module

Block ud.FixChain for making sure deprel=fixed|flat|goeswith|list does not form a chain.

class udapi.block.ud.fixchain.FixChain(deprels='fixed, flat, goeswith, list', **kwargs)[source]

Bases: udapi.core.block.Block

Make sure deprel=fixed etc. does not form a chain, but a flat structure.

process_node(node)[source]

Process a UD node

udapi.block.ud.fixpunct module

Block ud.FixPunct for making sure punctuation is attached projectively.

Punctuation in Universal Dependencies has the tag PUNCT, dependency relation punct, and is always attached projectively, usually to the head of a neighboring subtree to its left or right. Punctuation normally does not have children. If it does, we will fix it first.

This block tries to re-attach punctuation projectively and according to the guidelines. It should help in cases where punctuation is attached randomly, always to the root or always to the neighboring word. However, there are limits to what it can do; for example it cannot always recognize whether a comma is introduced to separate the block to its left or to its right. Hence if the punctuation before running this block is almost good, the block may actually do more harm than good.

Since the punctuation should not have children, we should not create a non-projectivity if we check the root edges going to the right. However, it is still possible that we will attach the punctuation non-projectively by joining a non-projectivity that already exists. For example, the left neighbor (node i-1) may have its parent at i-3, and the node i-2 forms a gap (does not depend on i-3).

class udapi.block.ud.fixpunct.FixPunct(**kwargs)[source]

Bases: udapi.core.block.Block

Make sure punctuation nodes are attached projectively.

process_tree(root)[source]

Process a UD tree

udapi.block.ud.fixpunctchild module

Block ud.FixPunctChild for making sure punctuation nodes have no children.

class udapi.block.ud.fixpunctchild.FixPunctChild(zones='all')[source]

Bases: udapi.core.block.Block

Make sure punct nodes have no children by rehanging the children upwards.

process_node(node)[source]

Process a UD node

udapi.block.ud.fixrightheaded module

Block ud.FixRightheaded for making sure flat,fixed,appos,goeswith,list is head initial.

Note that deprel=conj should also be left-headed, but it is not included in this fix-block by default because coordinations are more difficult to convert and one should use a specialized block instead.

class udapi.block.ud.fixrightheaded.FixRightheaded(deprels='flat, fixed, appos, goeswith, list', **kwargs)[source]

Bases: udapi.core.block.Block

Make sure deprel=flat,fixed,… form a head-initial (i.e. left-headed) structure.

process_node(node)[source]

Process a UD node

udapi.block.ud.goeswithfromtext module

Block GoeswithFromText for splitting nodes and attaching via goeswith according to the text.

Usage: udapy -s ud.GoeswithFromText < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.goeswithfromtext.GoeswithFromText(keep_lemma=False, **kwargs)[source]

Bases: udapi.core.block.Block

Block for splitting nodes and attaching via goeswith according to the the sentence text.

For example:: # text = Never the less, I agree. 1 Nevertheless nevertheless ADV _ _ 4 advmod _ SpaceAfter=No 2 , , PUNCT _ _ 4 punct _ _ 3 I I PRON _ _ 4 nsubj _ _ 4 agree agree VERB _ _ 0 root _ SpaceAfter=No 5 . . PUNCT _ _ 4 punct _ _

is changed to:: # text = Never the less, I agree. 1 Never never ADV _ _ 6 advmod _ _ 2 the the ADV _ _ 1 goeswith _ _ 3 less less ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _

If used with parameter keep_lemma=1, the result is:: # text = Never the less, I agree. 1 Never nevertheless ADV _ _ 6 advmod _ _ 2 the _ ADV _ _ 1 goeswith _ _ 3 less _ ADV _ _ 1 goeswith _ SpaceAfter=No 4 , , PUNCT _ _ 6 punct _ _ 5 I I PRON _ _ 6 nsubj _ _ 6 agree agree VERB _ _ 0 root _ SpaceAfter=No 7 . . PUNCT _ _ 6 punct _ _

process_tree(root)[source]

Process a UD tree

udapi.block.ud.google2ud module

Block ud.Google2ud for converting Google Universal Dependency Treebank into UD.

Usage: udapy -s ud.Google2ud < google.conllu > ud2.conllu

class udapi.block.ud.google2ud.Google2ud(lang='unk', non_mwt_langs='ar en ja ko zh', **kwargs)[source]

Bases: udapi.block.ud.convert1to2.Convert1to2

Convert Google Universal Dependency Treebank into UD style.

fix_deprel(node)[source]

Convert Google dependency relations to UD deprels.

Change topology where needed.

static fix_feats(node)[source]

Remove language prefixes, capitalize names and values, apply FEATS_CHANGE.

fix_goeswith(node)[source]

Solve deprel=goeswith which is almost always wrong in the Google annotation.

fix_multiword_prep(node)[source]

Solve pobj/pcomp depending on pobj/pcomp.

Only some of these cases are multi-word prepositions (which should get deprel=fixed).

fix_upos(node)[source]

PRT→PART, .→PUNCT, NOUN+Proper→PROPN, VERB+neg→AUX etc.

process_tree(root)[source]

Apply all the changes on the current tree.

This method is automatically called on each tree by Udapi. After doing tree-scope changes (remnants), it calls process_node on each node. By overriding this method in subclasses you can reuse just some of the implemented changes.

udapi.block.ud.joinasmwt module

Block ud.JoinAsMwt for creating multi-word tokens

if multiple neighboring words are not separated by a space and the boundaries between the word forms are alphabetical.

class udapi.block.ud.joinasmwt.JoinAsMwt(revert_orig_form=True, **kwargs)[source]

Bases: udapi.core.block.Block

Create MWTs if words are not separated by a space..

process_node(node)[source]

Process a UD node

udapi.block.ud.markbugs module

Block MarkBugs for checking suspicious/wrong constructions in UD v2.

See http://universaldependencies.org/release_checklist.html#syntax and http://universaldependencies.org/svalidation.html IMPORTANT: the svalidation.html overview is not generated by this code, but by SETS-search-interface rules, which may give different results than this code.

Usage: udapy -s ud.MarkBugs < in.conllu > marked.conllu 2> log.txt

Errors are both logged to stderr and marked within the nodes’ MISC field, e.g. node.misc[‘Bug’] = ‘aux-chain’, so the output conllu file can be searched for “Bug=” occurences.

Author: Martin Popel based on descriptions at http://universaldependencies.org/svalidation.html

class udapi.block.ud.markbugs.MarkBugs(save_stats=True, tests=None, skip=None, max_cop_lemmas=2, **kwargs)[source]

Bases: udapi.core.block.Block

Block for checking suspicious/wrong constructions in UD v2.

after_process_document(document)[source]

This method is called after each process_document.

log(node, short_msg, long_msg)[source]

Log node.address() + long_msg and add ToDo=short_msg to node.misc.

process_node(node)[source]

Process a UD node

udapi.block.ud.removemwt module

Block ud.RemoveMwt for removing multi-word tokens.

class udapi.block.ud.removemwt.RemoveMwt(zones='all')[source]

Bases: udapi.core.block.Block

Substitute MWTs with one word representing the whole MWT.

static guess_deprel(words)[source]

DEPREL of the whole MWT

static guess_feats(words)[source]

FEATS of the whole MWT

static guess_upos(words)[source]

UPOS of the whole MWT

process_tree(root)[source]

Process a UD tree

udapi.block.ud.setspaceafter module

Block SetSpaceAfter for heuristic setting of SpaceAfter=No.

Usage: udapy -s ud.SetSpaceAfter < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.setspaceafter.SetSpaceAfter(not_after='¡¿([{„', not_before='., ;:!?}])', fix_text=True, **kwargs)[source]

Bases: udapi.core.block.Block

Block for heuristic setting of the SpaceAfter=No MISC attribute.

static is_goeswith_exception(node)[source]

Is this node excepted from SpaceAfter=No because of the goeswith deprel?

Deprel=goeswith means that a space was (incorrectly) present in the original text, so we should not add SpaceAfter=No in these cases. We expect valid annotation of goeswith (no gaps, first token as head).

mark_no_space(node)[source]

Mark a node with SpaceAfter=No unless it is a goeswith exception.

process_tree(root)[source]

Process a UD tree

udapi.block.ud.setspaceafterfromtext module

Block SetSpaceAfterFromText for setting of SpaceAfter=No according to the sentence text.

Usage: udapy -s ud.SetSpaceAfterFromText < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.setspaceafterfromtext.SetSpaceAfterFromText(zones='all')[source]

Bases: udapi.core.block.Block

Block for setting of the SpaceAfter=No MISC attribute according to the sentence text.

process_tree(root)[source]

Process a UD tree

udapi.block.ud.splitunderscoretokens module

Block ud.SplitUnderscoreTokens splits tokens with underscores are attaches them using flat.

Usage: udapy -s ud.SplitUnderscoreTokens < in.conllu > fixed.conllu

Author: Martin Popel

class udapi.block.ud.splitunderscoretokens.SplitUnderscoreTokens(deprel=None, default_deprel='flat', **kwargs)[source]

Bases: udapi.core.block.Block

Block for spliting tokens with underscores and attaching the new nodes using deprel=flat.

E.g.:: 1 Hillary_Rodham_Clinton Hillary_Rodham_Clinton PROPN xpos 0 dep

is transformed into: 1 Hillary Hillary PROPN xpos 0 dep 2 Rodham Rodham PROPN xpos 1 flat 3 Clinton Clinton PROPN xpos 1 flat

Real-world use cases: UD_Irish (default_deprel=fixed) and UD_Czech-CLTT v1.4.

deprel_for(node)[source]

Return deprel of the newly created nodes: flat, fixed, compound or its subtypes.

See http://universaldependencies.org/u/dep/flat.html http://universaldependencies.org/u/dep/fixed.html http://universaldependencies.org/u/dep/compound.html Note that unlike the first two, deprel=compound does not need to be head-initial.

This method implements a coarse heuristic rules to decide between fixed and flat.

process_node(node)[source]

Process a UD node

Module contents
udapi.block.udpipe package
Submodules
udapi.block.udpipe.base module
udapi.block.udpipe.cs module
udapi.block.udpipe.en module
Module contents
udapi.block.util package
Submodules
udapi.block.util.eval module

Eval is a special block for evaluating code given by parameters.

class udapi.block.util.eval.Eval(doc=None, bundle=None, tree=None, node=None, start=None, end=None, before_doc=None, after_doc=None, before_bundle=None, after_bundle=None, expand_code=True, **kwargs)[source]

Bases: udapi.core.block.Block

Special block for evaluating code given by parameters.

Tricks: pp is a shortcut for pprint.pprint. $. is a shortcut for this. which is a shortcut for node., tree. etc. depending on context. count_X is a shortcut for self.count[X] where X is any string (S+) and self.count is a collections.Counter() instance. Thus you can use code like

util.Eval node=’count_$.upos +=1; count_”TOTAL” +=1’ end=”pp(self.count)”

after_process_document(document)[source]

This method is called after each process_document.

before_process_document(document)[source]

This method is called before each process_document.

expand_eval_code(to_eval)[source]

Expand ‘$.’ to ‘this.’, useful for oneliners.

process_bundle(bundle)[source]

Process a UD bundle

process_document(document)[source]

Process a UD document

process_end()[source]

A hook method that is executed after processing all UD data

process_start()[source]

A hook method that is executed before processing UD data

process_tree(tree)[source]

Process a UD tree

udapi.block.util.filter module

Filter is a special block for keeping/deleting subtrees specified by parameters.

class udapi.block.util.filter.Filter(delete_tree=None, delete_tree_if_node=None, delete_subtree=None, keep_tree=None, keep_tree_if_node=None, keep_subtree=None, mark=None, **kwargs)[source]

Bases: udapi.core.block.Block

Special block for keeping/deleting subtrees specified by parameters.

Example usage from command line: # extract subtrees governed by nouns (noun phrases) udapy -s util.Filter keep_subtree=’node.upos == “NOUN”’ < in.conllu > filtered.conllu

# keep only trees which contain ToDo|Bug nodes udapy -s util.Filter keep_tree_if_node=’re.match(“ToDo|Bug”, str(node.misc))’ < in > filtered

# keep only non-projective trees, annotate non-projective edges with Mark=nonproj and show. udapy -T util.Filter keep_tree_if_node=’node.is_nonprojective()’ mark=nonproj < in | less -R

# delete trees which contain deprel=remnant udapy -s util.Filter delete_tree_if_node=’node.deprel == “remnant”’ < in > filtered

# delete subtrees headed by a node with deprel=remnant udapy -s util.Filter delete_subtree=’node.deprel == “remnant”’ < in > filtered

process_tree(tree)[source]

Process a UD tree

udapi.block.util.findbug module

Block util.FindBug for debugging.

Usage: If block xy.Z fails with a Python exception, insert “util.FindBug block=” into the scenario, e.g. to debug second.Block, use

udapy first.Block util.FindBug block=second.Block > bug.conllu

This will create the file bug.conllu with the bundle, which caused the bug.

class udapi.block.util.findbug.FindBug(block, first_error_only=True, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

Debug another block by finding a minimal testcase conllu file.

process_document(document)[source]

Process a UD document

udapi.block.util.mark module

util.Mark is a special block for marking nodes specified by parameters.

class udapi.block.util.mark.Mark(node, mark=1, add=True, **kwargs)[source]

Bases: udapi.core.block.Block

Mark nodes specified by parameters.

Example usage from command line:: # see non-projective trees with non-projective edges highlighted udapy -TM util.Mark node=’node.is_nonprojective()’ < in | less -R

process_node(node)[source]

Process a UD node

udapi.block.util.markdiff module

util.MarkDiff is a special block for marking differences between parallel trees.

class udapi.block.util.markdiff.MarkDiff(gold_zone, attributes='form, lemma, upos, xpos, deprel, feats, misc', mark=1, add=False, **kwargs)[source]

Bases: udapi.core.block.Block

Mark differences between parallel trees.

process_tree(tree)[source]

Process a UD tree

udapi.block.util.resegmentgold module

util.ResegmentGold is a block for sentence alignment and re-segmentation of two zones.

class udapi.block.util.resegmentgold.ResegmentGold(gold_zone='gold', **kwargs)[source]

Bases: udapi.core.block.Block

Sentence-align two zones (gold and pred) and resegment the pred zone.

The two zones must contain the same sequence of characters.

static choose_root(p_tree, g_tree)[source]

Prevent multiple roots, which are forbidden in the evaluation script.

extract_pred_trees(document)[source]

Delete all trees with zone!=gold_zone from the document and return them.

process_document(document)[source]

Process a UD document

udapi.block.util.see module

Block util.See prints statistics about the nodes matching a given condition.

Example usage from the command line:

udapy util.See node=’node.is_nonprojective()’ n=3 stats=dir,children,c_upos,p_lemma,deprel,feats_split < in.conllu

Example output:

node.is_nonprojective() matches 245 out of 35766 nodes (0.7%) in 174 out of 1478 trees (11.8%) === dir (2 values) ===

right 193 78% delta=+37%
left 52 21% delta=-33%
=== children (9 values) ===
0 64 26% delta=-38% 2 58 23% delta=+14% 3 38 15% delta= +7%
=== c_upos (15 values) ===
NOUN 118 23% delta= +4%
DET 61 12% delta= -3%

PROPN 47 9% delta= +1%

=== p_lemma (187 values) ===
il 5 2% delta= +1%
fonction 4 1% delta= +1%
écrire 4 1% delta= +1%
=== deprel (22 values) ===
appos 41 16% delta=+15%
conj 41 16% delta=+13%

punct 36 14% delta= +4%

=== feats_split (20 values) ===

Number=Sing 114 21% delta= +2% Gender=Masc 81 15% delta= +3%

_ 76 14% delta= -6%

In addition to absolute counts for each value, the percentage within matching nodes is printed and a delta relative to percentage within all nodes. This helps to highlight what is special about the matching nodes.

class udapi.block.util.see.See(node, n=5, stats='dir, edge, depth, children, siblings, p_upos, p_lemma, c_upos, form, lemma, upos, deprel, feats_split', **kwargs)[source]

Bases: udapi.core.block.Block

Print statistics about the nodes specified by the parameter node.

process_end()[source]

A hook method that is executed after processing all UD data

process_node(node)[source]

Process a UD node

process_tree(root)[source]

Process a UD tree

udapi.block.util.split module

util.Split is a special block for splitting documents.

class udapi.block.util.split.Split(parts=None, bundles_per_doc=None, **kwargs)[source]

Bases: udapi.core.basereader.BaseReader

Split Udapi document (with sentence-aligned trees in bundles) into several parts.

static is_multizone_reader()[source]

Can this reader read bundles which contain more zones?.

This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).

process_document(document)[source]

Process a UD document

udapi.block.util.wc module

Wc is a special block for printing statistics (word count etc).

class udapi.block.util.wc.Wc(**kwargs)[source]

Bases: udapi.core.block.Block

Special block for printing statistics (word count etc).

process_end()[source]

A hook method that is executed after processing all UD data

process_tree(tree)[source]

Process a UD tree

Module contents
udapi.block.write package
Submodules
udapi.block.write.conllu module

Conllu class is a a writer of files in the CoNLL-U format.

class udapi.block.write.conllu.Conllu(print_sent_id=True, print_text=True, print_empty_trees=True, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of files in the CoNLL-U format.

before_process_document(document)[source]

Print doc_json_* headers.

process_tree(tree)[source]

Process a UD tree

udapi.block.write.html module

Html class is a writer for HTML+JavaScript+SVG visualization of dependency trees.

class udapi.block.write.html.Html(path_to_js='web', **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer for HTML+JavaScript+SVG visualization of dependency trees.

# from the command line
udapy write.Html < file.conllu > file.html
firefox file.html

For offline use, we need to download first three JavaScript libraries:

wget https://code.jquery.com/jquery-2.1.4.min.js
wget https://cdn.rawgit.com/eligrey/FileSaver.js/master/FileSaver.min.js
wget https://cdn.rawgit.com/ufal/js-treex-view/gh-pages/js-treex-view.js
udapy write.Html path_to_js=. < file.conllu > file.html
firefox file.html

This writer produces an html file with drawings of the dependency trees in the document (there are buttons for selecting which bundle will be shown). Under each node its form, upos and deprel are shown. In the tooltip its lemma and (morphological) features are shown. After clicking the node, all other attributes are shown. When hovering over a node, the respective word in the (plain text) sentence is highlighted. There is a button for downloading trees as SVG files.

Three JavaScript libraries are required (jquery, FileSaver and js-treex-view). By default they are linked online (so Internet access is needed when viewing), but they can be also downloaded locally (so offline browsing is possible and the loading is faster): see the Usage example above.

This block is based on Treex::View but takes a different approach. Treex::View depends on (older version of) Valence (Perl interface to Electron) and comes with a script view-treex, which takes a treex file, converts it to json behind the scenes (which is quite slow) and displays the json in a Valence window.

This block generates the json code directly to the html file, so it can be viewed with any browser or even published online. (Most of the html file is actually the json.)

When viewing the html file, the JavaScript library js-treex-view generates an svg on the fly from the json.

static print_node(node)[source]

JSON representation of a given node.

process_document(doc)[source]

Process a UD document

udapi.block.write.sdparse module

Sdparse class is a writer for Stanford dependencies format.

class udapi.block.write.sdparse.Sdparse(print_upos=True, print_feats=False, always_ord=False, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of files in the Stanford dependencies format, suitable for Brat visualization.

Usage: udapy write.Sdparse print_upos=0 < in.conllu

Example output:

~~~ sdparse
Corriere Sport da pagina 23 a pagina 26
name(Corriere, Sport)
case(pagina-4, da)
nmod(Corriere, pagina-4)
nummod(pagina-4, 23)
case(pagina-7, a)
nmod(Corriere, pagina-7)
nummod(pagina-7, 26)
~~~

To visualize it, use embedded Brat, e.g. go to http://universaldependencies.org/visualization.html#editing. Click the edit button and paste the output of this writer excluding the ~~~ marks.

Notes: The original Stanford dependencies format allows explicit specification of the root dependency, e.g. root(ROOT-0, makes-8). However, this is not allowed by Brat, so this writer does not print it.

UD v2.0 allows tokens with spaces, but I am not aware of any Brat support.

Alternatives:

  • write.Conllu Brat recently supports also the CoNLL-U input
  • write.TextModeTrees may be more readable/useful in some usecases
  • write.Html dtto, press “Save as SVG” button, convert to pdf
process_tree(tree)[source]

Process a UD tree

udapi.block.write.sentences module

Sentences class is a writer for plain-text sentences.

class udapi.block.write.sentences.Sentences(if_missing='detokenize', **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of plain-text sentences (one per line).

Usage: udapy write.Sentences if_missing=empty < my.conllu > my.txt

process_tree(tree)[source]

Process a UD tree

udapi.block.write.textmodetrees module

An ASCII pretty printer of dependency trees.

class udapi.block.write.textmodetrees.TextModeTrees(print_sent_id=True, print_text=True, add_empty_line=True, indent=1, minimize_cross=True, color='auto', attributes='form, upos, deprel', print_undef_as='_', print_doc_meta=True, print_comments=False, mark='ToDo|ToDoOrigText|Bug|Mark', marked_only=False, hints=True, **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

An ASCII pretty printer of dependency trees.

# from the command line (visualize CoNLL-U files)
udapy write.TextModeTrees color=1 < file.conllu | less -R

In scenario (examples of other parameters):

write.TextModeTrees indent=1 print_sent_id=1 print_sentence=1
write.TextModeTrees zones=en,cs attributes=form,lemma,upos minimize_cross=0

This block prints dependency trees in plain-text format. For example the following CoNLL-U file (with tabs instead of spaces):

1  I     I     PRON  PRP Number=Sing|Person=1 2  nsubj     _ _
2  saw   see   VERB  VBD Tense=Past           0  root      _ _
3  a     a     DET   DT  Definite=Ind         4  det       _ _
4  dog   dog   NOUN  NN  Number=Sing          2  dobj      _ _
5  today today NOUN  NN  Number=Sing          2  nmod:tmod _ SpaceAfter=No
6  ,     ,     PUNCT ,   _                    2  punct     _ _
7  which which DET   WDT PronType=Rel         10 nsubj     _ _
8  was   be    VERB  VBD Person=3|Tense=Past  10 cop       _ _
9  a     a     DET   DT  Definite=Ind         10 det       _ _
10 boxer boxer NOUN  NN  Number=Sing          4  acl:relcl _ SpaceAfter=No
11 .     .     PUNCT .   _                    2  punct     _ _

will be printed (with the default parameters) as:

─┮
 │ ╭─╼ I PRON nsubj
 ╰─┾ saw VERB root
   │                        ╭─╼ a DET det
   ├────────────────────────┾ dog NOUN dobj
   ├─╼ today NOUN nmod:tmod │
   ├─╼ , PUNCT punct        │
   │                        │ ╭─╼ which DET nsubj
   │                        │ ├─╼ was VERB cop
   │                        │ ├─╼ a DET det
   │                        ╰─┶ boxer NOUN acl:relcl
   ╰─╼ . PUNCT punct

Some non-projective trees cannot be printed witout crossing edges. TextModeTrees uses a special “bridge” symbol ─╪─ to mark this:

─┮
 │ ╭─╼ 1
 ├─╪───┮ 2
 ╰─┶ 3 │
       ╰─╼ 4

By default parameter color=auto, so if the output is printed to the console (not file or pipe), each node attribute is printed in different color. If a given node’s MISC contains any of ToDo, Bug or Mark attributes (or any other specified in the parameter mark), the node will be highlighted (by reveresing the background and foreground colors).

This block’s method process_tree can be called on any node (not only root), which is useful for printing subtrees using node.print_subtree(), which is internally implemented using this block.

SEE ALSO TextModeTreesHtml

add_node(idx, node)[source]

Render a node with its attributes.

before_process_document(document)[source]

Initialize ANSI colors if color is True or ‘auto’.

If color==’auto’, detect if sys.stdout is interactive (terminal, not redirected to a file).

static colorize_attr(attr, value, marked)[source]

Return a string with color markup for a given attr and its value.

colorize_comment(comment)[source]

Return a string with color markup for a given comment.

is_marked(node)[source]

Should a given node be highlighted?

print_headers(root)[source]

Print sent_id, text and other comments related to the tree.

process_tree(root)[source]

Print the tree to (possibly redirected) sys.stdout.

should_print_tree(root)[source]

Should this tree be printed?

udapi.block.write.textmodetreeshtml module

An ASCII pretty printer of colored dependency trees in HTML.

class udapi.block.write.textmodetreeshtml.TextModeTreesHtml(color=True, title='Udapi visualization', **kwargs)[source]

Bases: udapi.block.write.textmodetrees.TextModeTrees

An ASCII pretty printer of colored dependency trees in HTML.

SYNOPSIS # from command line (visualize CoNLL-U files) udapy write.TextModeTreesHtml < file.conllu > file.html

This block is a subclass of TextModeTrees, see its documentation for more info.

add_node(idx, node)[source]

Render a node with its attributes.

after_process_document(document)[source]

This method is called after each process_document.

before_process_document(document)[source]

Initialize ANSI colors if color is True or ‘auto’.

If color==’auto’, detect if sys.stdout is interactive (terminal, not redirected to a file).

static colorize_attr(attr, value, marked)[source]

Return a string with color markup for a given attr and its value.

colorize_comment(comment)[source]

Return a string with color markup for a given comment.

print_headers(root)[source]

Print sent_id, text and other comments related to the tree.

udapi.block.write.tikz module

Tikz class is a writer for LaTeX with tikz-dependency.

class udapi.block.write.tikz.Tikz(print_sent_id=True, print_text=True, print_preambule=True, attributes='form, upos', **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of files in the LaTeX with tikz-dependency format.

Usage:

udapy write.Tikz < my.conllu > my.tex
pdflatex my.tex
xdg-open my.pdf

Long sentences may result in too large pictures. You can tune the width (in addition to changing fontsize or using minipage and rescaling) with \begin{deptext}[column sep=0.2cm] or individually for each word: My \&[.5cm] dog \& etc. By default, the height of the horizontal segment of a dependency edge is proportional to the distance between the linked words. You can tune the height with: \depedge[edge unit distance=1.5ex]{9}{1}{deprel}

See tikz-dependency documentation for details.

Alternatives: * use write.TextModeTrees and include it in verbatim environment in LaTeX. * use write.Html, press “Save as SVG” button, convert to pdf and include in LaTeX.

after_process_document(doc)[source]

This method is called after each process_document.

before_process_document(doc)[source]

This method is called before each process_document.

process_tree(tree)[source]

Process a UD tree

udapi.block.write.treex module

write.Treex is a writer block for Treex XML (e.g. for TrEd editing).

class udapi.block.write.treex.Treex(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of files in the Treex format.

after_process_document(doc)[source]

This method is called after each process_document.

before_process_document(doc)[source]

This method is called before each process_document.

print_subtree(node, tree_id, indent)[source]

Recrsively print trees in Treex format.

process_bundle(bundle)[source]

Process a UD bundle

process_tree(tree)[source]

Process a UD tree

udapi.block.write.vislcg module

Vislcg class is a writer for the VISL-cg format.

class udapi.block.write.vislcg.Vislcg(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]

Bases: udapi.core.basewriter.BaseWriter

A writer of files in the VISL-cg format, suitable for VISL Constraint Grammer Parser.

See https://visl.sdu.dk/visl/vislcg-doc.html

Usage: udapy write.Vislcg < in.conllu > out.vislcg

Example output:

"<Қыз>"
        "қыз" n nom @nsubj #1->3
"<оның>"
        "ол" prn pers p3 sg gen @nmod:poss #2->3
"<қарындасы>"
        "қарындас" n px3sp nom @parataxis #3->8
            "е" cop aor p3 sg @cop #4->3
"<,>"
        "," cm @punct #5->8
"<ол>"
        "ол" prn pers p3 sg nom @nsubj #6->8
"<бес>"
        "бес" num @nummod #7->8
"<жаста>"
        "жас" n loc @root #8->0
            "е" cop aor p3 sg @cop #9->8
"<.>"
        "." sent @punct #10->8

Example input:

# text = Қыз оның қарындасы, ол бес жаста.
1    Қыз        қыз       _  n     nom             3  nsubj      _  _
2    оның       ол        _  prn   pers|p3|sg|gen  3  nmod:poss  _  _
3-4  қарындасы  _         _  _     _               _  _          _  _
3    қарындасы  қарындас  _  n     px3sp|nom       8  parataxis  _  _
4    _          е         _  cop   aor|p3|sg       3  cop        _  _
5    ,          ,         _  cm    _               8  punct      _  _
6    ол         ол        _  prn   pers|p3|sg|nom  8  nsubj      _  _
7    бес        бес       _  num   _               8  nummod     _  _
8-9  жаста      _         _  _     _               _  _          _  _
8    жаста      жас       _  n     loc             0  root       _  _
9    _          е         _  cop   aor|p3|sg       8  cop        _  _
10   .          .         _  sent  _               8  punct      _  _
process_tree(tree)[source]

Process a UD tree

Module contents
udapi.block.zellig_harris package
Submodules
udapi.block.zellig_harris.baseline module
class udapi.block.zellig_harris.baseline.Baseline(args=None)[source]

Bases: udapi.core.block.Block

A block for extraction context configurations for training verb representations using word2vecf.

get_word(node)[source]

Format the correct string representation of the given node according to the block settings.

Parameters:node – A input node.
Returns:A node’s string representation.
print_triple(target_node, context_node, relation_name)[source]

Print to the standard output the context triple according to the block settings.

Parameters:
  • target_node – A target word.
  • context_node – A context word.
  • relation_name – A relation name.
process_node(node)[source]

Extract context configuration for verbs according to (Vulic et al., 2016).

Parameters:node – A node to be process.
udapi.block.zellig_harris.common module
udapi.block.zellig_harris.common.get_node_representation(node, print_lemma=False)[source]

Transform the node into the proper textual representation, as will appear in the extracted contexts.

Parameters:
  • node – An input Node.
  • print_lemma – If true, the node lemma is used, otherwise the node form.
Returns:

A proper node textual representation for the contexts data.

udapi.block.zellig_harris.common.print_triple(node_a, relation_name, node_b, print_lemma=False)[source]

Print to the standard output the context.

udapi.block.zellig_harris.configurations module
class udapi.block.zellig_harris.configurations.Configurations(args=None)[source]

Bases: udapi.core.block.Block

An abstract class for four extracting scenarios.

apply_query(query_id, node)[source]

A generic method for applying a specified query on a specified node.

Parameters:
  • query_id – A name of the query method to be called.
  • node – An input node.
process_node(node)[source]

Extract context configuration for verbs according to (Vulic et al., 2016).

Parameters:node – A node to be process.
process_tree(tree)[source]

If required, print detailed info about the processed sentence.

Parameters:tree – A sentence to be processed.
udapi.block.zellig_harris.csnouns module
class udapi.block.zellig_harris.csnouns.CsNouns(args=None)[source]

Bases: udapi.block.zellig_harris.configurations.Configurations

A block for extraction context configurations for Czech nouns. The configurations will be used as the train data for obtaining the word representations using word2vecf.

process_node(node)[source]

Extract context configurations for Czech nouns.

Parameters:node – A node to be process.
udapi.block.zellig_harris.csverbs module
class udapi.block.zellig_harris.csverbs.CsVerbs(args=None)[source]

Bases: udapi.block.zellig_harris.configurations.Configurations

A block for extraction context configurations for Czech verbs. The configurations will be used as the train data for obtaining the word representations using word2vecf.

process_node(node)[source]

Extract context configurations for Czech verbs.

Parameters:node – A node to be process.
udapi.block.zellig_harris.enhancedeps module
class udapi.block.zellig_harris.enhancedeps.EnhanceDeps(zones='all')[source]

Bases: udapi.core.block.Block

Identify new relations between nodes in the dependency tree (an analogy of effective parents/children from PML). Add these new relations into secondary dependencies slot.

process_node(node)[source]

Enhance secondary dependencies by application of the following rules: 1. when the current node A has a deprel ‘conj’ to its parent B,

create a new secondary dependence (B.parent, B.deprel) to A
  1. when the current node A has a deprel ‘conj’ to its parent B, look at B.children C when C.deprel is in {subj, subjpass, iobj, dobj, compl} and there is no A.children D such that C.deprel == D.deprel, add a new secondary dependence (A, C.deprel) to C
Parameters:node – A node to be process.
udapi.block.zellig_harris.enhancedeps.echildren(node)[source]

Return a list with node’s effective children.

Parameters:node – An input node.
Returns:A list with node’s effective children.
Return type:list
udapi.block.zellig_harris.enhancedeps.enhance_deps(node, new_dependence)[source]

Add a new dependence to the node.deps, but firstly check if there is no such dependence already.

Parameters:
  • node – A node to be enhanced.
  • new_dependence – A new dependence to be add into node.deps.
udapi.block.zellig_harris.enhancedeps.eparent(node)[source]

Return an effective parent for the given node.

The rule for the effective parent - when the current node A has a deprel ‘conj’ to its parent B, return B.parent, otherwise return A.parent.

Parameters:node – An input node.
Returns:An effective parent.
Return type:udapi.core.node.Node
udapi.block.zellig_harris.ennouns module
class udapi.block.zellig_harris.ennouns.EnNouns(args=None)[source]

Bases: udapi.block.zellig_harris.configurations.Configurations

A block for extraction context configurations for English nouns.

The configurations will be used as the train data for obtaining the word representations using word2vecf.

process_node(node)[source]

Extract context configurations for English nouns.

Parameters:node – A node to be process.
udapi.block.zellig_harris.enverbs module
class udapi.block.zellig_harris.enverbs.EnVerbs(args=None)[source]

Bases: udapi.block.zellig_harris.configurations.Configurations

A block for extraction context configurations for English verbs.

The configurations will be used as the train data for obtaining the word representations using word2vecf.

process_node(node)[source]

Extract context configurations for English verbs.

Parameters:node – A node to be process.
udapi.block.zellig_harris.queries module
udapi.block.zellig_harris.queries.en_verb_mydobj(node)[source]

Extract the ‘myobj’ relation.

Module contents
Module contents
udapi.core package
Submodules
udapi.core.basereader module

BaseReader is the base class for all reader blocks.

class udapi.core.basereader.BaseReader(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]

Bases: udapi.core.block.Block

Base class for all reader blocks.

file_number

Property with the current file number (1-based).

filehandle

Property with the current file handle.

filename

Property with the current filename.

filtered_read_tree()[source]

Load and return one more tree matching the sent_id_filter.

This method uses read_tree() internally. This is the method called by process_document.

static is_multizone_reader()[source]

Can this reader read bundles which contain more zones?.

This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).

next_filehandle()[source]

Go to the next file and retrun its filehandle.

process_document(document)[source]

Process a UD document

read_tree()[source]

Load one (more) tree from self.files and return its root.

This method must be overriden in all readers. Usually it is the only method that needs to be implemented. The implementation in this base clases raises NotImplementedError.

udapi.core.basewriter module

BaseWriter is the base class for all writer blocks.

class udapi.core.basewriter.BaseWriter(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]

Bases: udapi.core.block.Block

Base class for all reader blocks.

after_process_document(document)[source]

This method is called after each process_document.

before_process_document(document)[source]

This method is called before each process_document.

file_number

Property with the current file number (1-based).

filename

Property with the current filehandle.

next_filename()[source]

Go to the next file and retrun its filename.

udapi.core.block module

Block class represents the basic Udapi processing unit.

class udapi.core.block.Block(zones='all')[source]

Bases: object

The smallest processing unit for processing Universal Dependencies data.

after_process_document(document)[source]

This method is called after each process_document.

apply_on_document(document)[source]
before_process_document(document)[source]

This method is called before each process_document.

process_bundle(bundle)[source]

Process a UD bundle

process_document(document)[source]

Process a UD document

process_end()[source]

A hook method that is executed after processing all UD data

process_node(_)[source]

Process a UD node

process_start()[source]

A hook method that is executed before processing UD data

process_tree(tree)[source]

Process a UD tree

udapi.core.bundle module

Bundle class represents one sentence.

class udapi.core.bundle.Bundle(bundle_id=None, document=None)[source]

Bases: object

Bundle represents one sentence in an UD document.

A bundle contains one or more trees. More trees are needed e.g. in case of parallel treebanks where each tree represents a translation of the sentence in a different languages. Trees in one bundle are distinguished by a zone label.

add_tree(root)[source]

Add an existing tree to the bundle.

address()[source]

Return bundle_id or ‘?’ if missing.

bundle_id

ID of this bundle.

check_zone(new_zone)[source]

Raise an exception if the zone is invalid or already exists.

create_tree(zone=None)[source]

Return the root of a newly added tree with a given zone.

document()[source]

Returns the document in which the bundle is contained.

get_tree(zone='')[source]

Returns the tree root whose zone is equal to zone.

has_tree(zone='')[source]

Does this bundle contain a tree with a given zone?

number
remove()[source]

Remove a bundle from the document.

trees
udapi.core.document module

Document class is a container for UD trees.

class udapi.core.document.Document[source]

Bases: object

Document is a container for Universal Dependency trees.

create_bundle()[source]

Create a new bundle and add it at the end of the document.

from_conllu_string(string)[source]

Load a document from a conllu-formatted string.

load_conllu(filename=None)[source]

Load a document from a conllu-formatted file.

store_conllu(filename)[source]

Store a document into a conllu-formatted file.

to_conllu_string()[source]

Return the document as a conllu-formatted string.

udapi.core.dualdict module

DualDict is a dict with lazily synchronized string representation.

class udapi.core.dualdict.DualDict(value=None, **kwargs)[source]

Bases: collections.abc.MutableMapping

DualDict class serves as dict with lazily synchronized string representation.

>>> ddict = DualDict('Number=Sing|Person=1')
>>> ddict['Case'] = 'Nom'
>>> str(ddict)
'Case=Nom|Number=Sing|Person=1'
>>> ddict['NonExistent']
''

This class provides access to both * a structured (dict-based, deserialized) representation,

e.g. {‘Number’: ‘Sing’, ‘Person’: ‘1’}, and
  • a string (serialized) representation of the mapping, e.g. Number=Sing|Person=1.

There is a clever mechanism that makes sure that users can read and write both of the representations which are always kept synchronized. Moreover, the synchronization is lazy, so the serialization and deserialization is done only when needed. This speeds up scenarios where access to dict is not needed.

A value can be deleted with any of the following three ways: >>> del ddict[‘Case’] >>> ddict[‘Case’] = None >>> ddict[‘Case’] = ‘’ and it works even if the value was already missing.

clear() → None. Remove all items from D.[source]
copy()[source]

Return a deep copy of this instance.

set_mapping(value)[source]

Set the mapping from a dict or string.

If the value is None or an empty string, it is converted to storing string _ (which is the CoNLL-U way of representing an empty value). If the value is a string, it is stored as is. If the value is a dict (or any instance of collections.abc.Mapping), its copy is stored. Other types of value raise an ValueError exception.

udapi.core.feats module

Feats class for storing morphological features of nodes in UD trees.

class udapi.core.feats.Feats(value=None, **kwargs)[source]

Bases: udapi.core.dualdict.DualDict

Feats class for storing morphological features of nodes in UD trees.

See http://universaldependencies.org/u/feat/index.html for the specification of possible feature names and values.

is_plural()[source]

Is the grammatical number plural (feats[‘Number’] contains ‘Plur’)?

is_singular()[source]

Is the grammatical number singular (feats[‘Number’] contains ‘Sing’)?

udapi.core.files module

Files is a helper class for iterating over filenames.

class udapi.core.files.Files(filenames=None, filehandle=None, encoding='utf-8')[source]

Bases: object

Helper class for iterating over filenames.

It is used e.g. in udapi.core.basereader (as self.files = Files(filenames=pattern)). Constructor takes various arguments: >>> files = Files([‘file1.txt’, ‘file2.txt’]) # list of filenames or >>> files = Files(‘file1.txt,file2.txt’) # comma- or space-separated filenames in string >>> files = Files(‘file1.txt,file2.txt.gz’) # supports automatic decompression of gz, xz, bz2 >>> files = Files(@my.filelist !dir??/file*.txt’) # @ marks filelist, ! marks wildcard pattern The @filelist and !wildcard conventions are used in several other tools, e.g. 7z or javac.

Usage: >>> while (True): >>> filename = files.next_filename()

if filename is None:
break

or >>> filehandle = files.next_filehandle()

filename

Property with the current file name.

has_next_file()[source]

Is there any other file in the queue after the current one?

next_filehandle()[source]

Go to the next file and retrun its filehandle or None (meaning no more files).

next_filename()[source]

Go to the next file and retrun its filename or None (meaning no more files).

number_of_files

Propery with the total number of files.

string_to_filenames(string)[source]

Parse a pattern string (e.g. ‘!dir??/file*.txt’) and return a list of matching filenames.

If the string starts with ! it is interpreted as shell wildcard pattern. If it starts with @ it is interpreted as a filelist with one file per line. The string can contain more filenames (or ‘!’ and ‘@’ patterns) separated by spaces or commas. For specifying files with spaces or commas in filenames, you need to use wildcard patterns or ‘@’ filelist. (But preferably don’t use such filenames.)

udapi.core.mwt module

MWT class represents a multi-word token.

class udapi.core.mwt.MWT(words=None, form=None, misc=None, root=None)[source]

Bases: object

Class for representing multi-word tokens in UD trees.

address()[source]

Full (document-wide) id of the multi-word token.

form
misc

Property for MISC attributes stored as a DualDict object.

See udapi.core.node.Node for details.

ord_range()[source]

Return a string suitable for the first column of CoNLL-U.

remove()[source]

Delete this multi-word token (but keep its words).

root
words
udapi.core.node module

Node class and related classes and functions.

In addition to class Node, this module contains class ListOfNodes and function find_minimal_common_treelet.

class udapi.core.node.ListOfNodes(iterable, origin)[source]

Bases: list

Helper class for results of node.children and node.descendants.

Python distinguishes properties, e.g. node.form … no brackets, and methods, e.g. node.remove() … brackets necessary. It is useful (and expected by Udapi users) to use properties, so one can do e.g. node.form += “suffix”. It is questionable whether node.parent, node.root, node.children etc. should be properties or methods. The problem of methods is that if users forget the brackets, the error may remain unnoticed because the result is interpreted as a method reference. The problem of properties is that they cannot have any parameters. However, we would like to allow e.g. node.children(add_self=True).

This class solves the problem: node.children and node.descendants are properties which return instances of this clas ListOfNodes. This class implements the method __call__, so one can use e.g. nodes = node.children nodes = node.children() nodes = node.children(add_self=True, following_only=True)

class udapi.core.node.Node(form=None, lemma=None, upos=None, xpos=None, feats=None, deprel=None, misc=None)[source]

Bases: object

Class for representing nodes in Universal Dependency trees.

Attributes form, lemma, upos, xpos and deprel are public attributes of type str, so you can use e.g. node.lemma = node.form.

node.ord is a int type public attribute for storing the node’s word order index, but assigning to it should be done with care, so the non-root nodes have ord`s 1,2,3… It is recommended to use one of the `node.shift_* methods for reordering nodes.

For changing dependency structure (topology) of the tree, there is the parent property, e.g. node.parent = node.parent.parent and node.create_child() method. Properties node.children and node.descendants return object of type ListOfNodes, so it is possible to do e.g. >>> all_children = node.children >>> left_children = node.children(preceding_only=True) >>> right_descendants = node.descendants(following_only=True, add_self=True)

Properties node.feats and node.misc return objects of type DualDict, so one can do e.g.: >>> node = Node() >>> str(node.feats) ‘_’ >>> node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’}` >>> node.feats = ‘Case=Nom|Person=1’ # equivalent to the above >>> node.feats[‘Case’] ‘Nom’ >>> node.feats[‘NonExistent’] ‘’ >>> node.feats[‘Case’] = ‘Gen’ >>> str(node.feats) ‘Case=Gen|Person=1’ >>> dict(node.feats) {‘Case’: ‘Gen’, ‘Person’: ‘1’}

Handling of enhanced dependencies, multi-word tokens and other node’s methods are described below.

address()[source]

Return full (document-wide) id of the node.

For non-root nodes, the general address format is: node.bundle.bundle_id + ‘/’ + node.root.zone + ‘#’ + node.ord, e.g. s123/en_udpipe#4. If zone is empty, the slash is excluded as well, e.g. s123#4.

children

Return a list of dependency children (direct dependants) nodes.

The returned nodes are sorted by their ord. Note that node.children is a property, not a method, so if you want all the children of a node (excluding the node itself), you should not use node.children(), but just

node.children
However, the returned result is a callable list, so you can use
nodes1 = node.children(add_self=True) nodes2 = node.children(following_only=True) nodes3 = node.children(preceding_only=True) nodes4 = node.children(preceding_only=True, add_self=True)
as a shortcut for
nodes1 = sorted([node] + node.children, key=lambda n: n.ord) nodes2 = [n for n in node.children if n.ord > node.ord] nodes3 = [n for n in node.children if n.ord < node.ord] nodes4 = [n for n in node.children if n.ord < node.ord] + [node]

See documentation of ListOfNodes for details.

compute_text(use_mwt=True)[source]

Return a string representing this subtree’s text (detokenized).

Compute the string by concatenating forms of nodes (words and multi-word tokens) and joining them with a single space, unless the node has SpaceAfter=No in its misc. If called on root this method returns a string suitable for storing in root.text (but it is not stored there automatically).

Technical details: If called on root, the root’s form (<ROOT>) is not included in the string. If called on non-root nodeA, nodeA’s form is included in the string, i.e. internally descendants(add_self=True) is used. Note that if the subtree is non-projective, the resulting string may be misleading.

Args: use_mwt: consider multi-word tokens? (default=True)

create_child(**kwargs)[source]

Create and return a new child of the current node.

create_empty_child(**kwargs)[source]

Create and return a new empty node child of the current node.

deprel
deps

Return enhanced dependencies as a Python list of dicts.

After the first access to the enhanced dependencies, provide the deserialization of the raw data and save deps to the list.

descendants

Return a list of all descendants of the current node.

The returned nodes are sorted by their ord. Note that node.descendants is a property, not a method, so if you want all the descendants of a node (excluding the node itself), you should not use node.descendants(), but just

node.descendants
However, the returned result is a callable list, so you can use
nodes1 = node.descendants(add_self=True) nodes2 = node.descendants(following_only=True) nodes3 = node.descendants(preceding_only=True) nodes4 = node.descendants(preceding_only=True, add_self=True)
as a shortcut for
nodes1 = sorted([node] + node.descendants, key=lambda n: n.ord) nodes2 = [n for n in node.descendants if n.ord > node.ord] nodes3 = [n for n in node.descendants if n.ord < node.ord] nodes4 = [n for n in node.descendants if n.ord < node.ord] + [node]

See documentation of ListOfNodes for details.

feats

Property for morphological features stored as a Feats object.

Reading: You can access node.feats as a dict, e.g. if node.feats[‘Case’] == ‘Nom’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.feats[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole FEATS (suitable for CoNLL-U), e.g. if node.feats == ‘Case=Nom|Person=1’.

Writing: All the following assignment types are supported: node.feats[‘Case’] = ‘Nom’ node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’} node.feats = ‘Case=Nom|Person=1’ node.feats = ‘_’ The last line has the same result as assigning None or empty string to node.feats.

For details about the implementation and other methods (e.g. node.feats.is_plural()), see udapi.core.feats.Feats which is a subclass of DualDict.

form
get_attrs(attrs, undefs=None, stringify=True)[source]

Return multiple attributes or pseudo-attributes, possibly substituting empty ones.

Pseudo-attributes: p_xy is the (pseudo) attribute xy of the parent node. c_xy is a list of the (pseudo) attributes xy of the children nodes. l_xy is the (pseudo) attribute xy of the previous (left in LTR langs) node. r_xy is the (pseudo) attribute xy of the following (right in LTR langs) node. dir: ‘left’ = the node is a left child of its parent,

‘right’ = the node is a rigth child of its parent, ‘root’ = the node’s parent is the technical root.

edge: length of the edge to parent (node.ord - node.parent.ord) or 0 if parent is root children: number of children nodes. siblings: number of siblings nodes. depth: depth in the dependency tree (technical root has depth=0, highest word has depth=1). feats_split: list of name=value formatted strings of the FEATS.

Args: attrs: A list of attribute names, e.g. ['form', 'lemma', 'p_upos']. undefs: A value to be used instead of None for empty (undefined) values. stringify: Apply str() on each value (except for None)

is_descendant_of(node)[source]

Is the current node a descendant of the node given as argument?

is_leaf()[source]

Is this node a leaf, ie. a node without any children?

is_nonprojective()[source]

Is the node attached to its parent non-projectively?

Is there at least one node between (word-order-wise) this node and its parent that is not dominated by the parent? For higher speed, the actual implementation does not find the node(s) which cause(s) the gap. It only checks the number of parent’s descendants in the span and the total number of nodes in the span.

is_nonprojective_gap()[source]

Is the node causing a non-projective gap within another node’s subtree?

Is there at least one node X such that - this node is not a descendant of X, but - this node is within span of X, i.e. it is between (word-order-wise)

X’s leftmost descendant (or X itself) and X’s rightmost descendant (or X itself).
static is_root()[source]

Is the current node a (technical) root?

Returns False for all Node instances, irrespectively of whether is has a parent or not. True is returned only by instances of udapi.core.root.Root.

lemma
misc

Property for MISC attributes stored as a DualDict object.

Reading: You can access node.misc as a dict, e.g. if node.misc[‘SpaceAfter’] == ‘No’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.misc[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole MISC (suitable for CoNLL-U), e.g. if node.misc == ‘SpaceAfter=No|X=Y’.

Writing: All the following assignment types are supported: node.misc[‘SpaceAfter’] = ‘No’ node.misc = {‘SpaceAfter’: ‘No’, ‘X’: ‘Y’} node.misc = ‘SpaceAfter=No|X=Y’ node.misc = ‘_’ The last line has the same result as assigning None or empty string to node.feats.

For details about the implementation, see udapi.core.dualdict.DualDict.

multiword_token

Return the multi-word token which includes this node, or None.

If this node represents a (syntactic) word which is part of a multi-word token, this method returns the instance of udapi.core.mwt.MWT. If this nodes is not part of any multi-word token, this method returns None.

next_node

Return the following node according to word order.

no_space_after

Boolean property as a shortcut for node.misc[“SpaceAfter”] == “No”.

ord
parent

Return dependency parent (head) node.

precedes(node)[source]

Does this node precedes another node in word order (self.ord < node.ord)?

prev_node

Return the previous node according to word order.

print_subtree(**kwargs)[source]

Print ASCII visualization of the dependency structure of this subtree.

This method is useful for debugging. Internally udapi.block.write.textmodetrees.TextModeTrees is used for the printing. All keyword arguments of this method are passed to its constructor, so you can use e.g.: files: to redirect sys.stdout to a file indent: to have wider trees attributes: to override the default list ‘form,upos,deprel’ See TextModeTrees for details and other parameters.

raw_deps

String serialization of enhanced dependencies as stored in CoNLL-U files.

After the access to the raw enhanced dependencies, provide the serialization if they were deserialized already.

remove(children=None)[source]

Delete this node and all its descendants.

Args: children: a string specifying what to do if the node has any children.

The default (None) is to delete them (and all their descendants). rehang means to re-attach those children to the parent of the removed node. warn means to issue a warning if any children are present and delete them. rehang_warn means to rehang and warn:-).
root

Return the (technical) root node of the whole tree.

sdeprel

Return the language-specific part of dependency relation.

E.g. if deprel = acl:relcl then sdeprel = relcl. If deprel=`acl` then sdeprel = empty string. If deprel is None then node.sdeprel will return None as well.

shift(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]

Internal method for changing word order.

shift_after_node(reference_node)[source]

Shift this node after the reference_node.

shift_after_subtree(reference_node, without_children=0)[source]

Shift this node (and its subtree) after the subtree rooted by reference_node.

Args: without_children: shift just this node without its subtree?

shift_before_node(reference_node)[source]

Shift this node after the reference_node.

shift_before_subtree(reference_node, without_children=0)[source]

Shift this node (and its subtree) before the subtree rooted by reference_node.

Args: without_children: shift just this node without its subtree?

udeprel

Return the universal part of dependency relation, e.g. acl instead of acl:relcl.

So you can write node.udeprel instead of node.deprel.split(‘:’)[0].

unordered_descendants()[source]

Return a list of all descendants in any order.

upos
xpos
udapi.core.node.find_minimal_common_treelet(*args)[source]

Find the smallest tree subgraph containing all nodes provided in args.

>>> from udapi.core.node import find_minimal_common_treelet
>>> (nearest_common_ancestor, _) = find_minimal_common_treelet(nodeA, nodeB)
>>> nodes = [nodeA, nodeB, nodeC]
>>> (nca, added_nodes) = find_minimal_common_treelet(*nodes)

There always exists exactly one such tree subgraph (aka treelet). This function returns a tuple (root, added_nodes), where root is the root of the minimal treelet and added_nodes is an iterator of nodes that had to be added to nodes to form the treelet. The nodes should not contain one node twice.

udapi.core.resource module

Utilities for downloading models and ither resources.

udapi.core.resource.require_file(path)[source]

Return absolute path to the file and download it if missing.

udapi.core.root module

Root class represents the technical root node in each tree.

class udapi.core.root.Root(zone=None, comment='', text=None, newpar=None, newdoc=None)[source]

Bases: udapi.core.node.Node

Class for representing root nodes (technical roots) in UD trees.

add_comment(string)[source]

Add a given string to root.comment separated by a newline and space.

address()[source]

Full (document-wide) id of the root.

The general format of root nodes is: root.bundle.bundle_id + ‘/’ + root.zone, e.g. s123/en_udpipe. If zone is empty, the slash is excluded as well, e.g. s123. If bundle is missing (could occur during loading), ‘?’ is used instead. Root’s address is stored in CoNLL-U files as sent_id (in a special comment).

bundle

Return the bundle which this tree belongs to.

comment
create_multiword_token(words=None, form=None, misc=None)[source]

Create and return a new multi-word token (MWT) in this tree.

The new MWT can be optionally initialized using the following args. Args: words: a list of nodes which are part of the new MWT form: string representing the surface form of the new MWT misc: misc attribute of the new MWT

descendants

Return a list of all descendants of the current node.

The nodes are sorted by their ord. This root-specific implementation returns all the nodes in the tree except the root itself.

empty_nodes
get_sentence(if_missing='detokenize')[source]

Return either the stored root.text or (if None) root.compute_text().

Args: if_missing: What to do if root.text is None? (default=detokenize)

  • detokenize: use root.compute_text() to compute the sentence.
  • empty: return an empty string
  • warn_detokenize, warn_empty: in addition emit a warning via logging.warning()
  • fatal: raise an exception
is_descendant_of(node)[source]

Is the current node a descendant of the node given as argument?

This root-specific implementation returns always False.

is_root()[source]

Return True for all Root instances.

json
multiword_tokens

Return a list of all multi-word tokens in this tree.

newdoc
newpar
parent

Return dependency parent (head) node.

This root-specific implementation returns always None.

remove(children=None)[source]

Remove the whole tree from its bundle.

Args: children: a string specifying what to do if the root has any children.

The default (None) is to delete them (and all their descendants). warn means to issue a warning.
sent_id

ID of this tree, stored in the sent_id comment in CoNLL-U.

shift(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]

Attempts at changing the word order of root result in Exception.

steal_nodes(nodes)[source]

Move nodes from another tree to this tree (append).

text
token_descendants

Return all tokens (one-word or multi-word) in the tree.

ie. return a list of core.Node and core.MWT instances, whose forms create the raw sentence. Skip nodes, which are part of multi-word tokens.

For example with: 1-2 vámonos _ 1 vamos ir 2 nos nosotros 3-4 al _ 3 a a 4 el el 5 mar mar

[n.form for n in root.token_descendants] will return [‘vámonos’, ‘al’, ‘mar’].

zone

Return zone (string label) of this tree.

udapi.core.run module

Class Run parses a scenario and executes it.

class udapi.core.run.Run(args)[source]

Bases: object

Processing unit that processes UD data; typically a sequence of blocks.

execute()[source]

Parse given scenario and execute it.

scenario_string()[source]

Return the scenario string.

Module contents
udapi.tool package
Submodules
udapi.tool.morphodita module
udapi.tool.udpipe module
Module contents
Module contents

Indices and tables