udapi.core package

Submodules

udapi.core.basereader module

BaseReader is the base class for all reader blocks.

class udapi.core.basereader.BaseReader(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]

Bases: udapi.core.block.Block

Base class for all reader blocks.

file_number

Property with the current file number (1-based).

filehandle

Property with the current file handle.

filename

Property with the current filename.

filtered_read_tree()[source]

Load and return one more tree matching the sent_id_filter.

This method uses read_tree() internally. This is the method called by process_document.

static is_multizone_reader()[source]

Can this reader read bundles which contain more zones?.

This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).

next_filehandle()[source]

Go to the next file and retrun its filehandle.

process_document(document)[source]

Process a UD document

read_tree()[source]

Load one (more) tree from self.files and return its root.

This method must be overriden in all readers. Usually it is the only method that needs to be implemented. The implementation in this base clases raises NotImplementedError.

udapi.core.basewriter module

BaseWriter is the base class for all writer blocks.

class udapi.core.basewriter.BaseWriter(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]

Bases: udapi.core.block.Block

Base class for all reader blocks.

after_process_document(document)[source]

This method is called after each process_document.

before_process_document(document)[source]

This method is called before each process_document.

file_number

Property with the current file number (1-based).

filename

Property with the current filehandle.

next_filename()[source]

Go to the next file and retrun its filename.

udapi.core.block module

Block class represents the basic Udapi processing unit.

class udapi.core.block.Block(zones='all')[source]

Bases: object

The smallest processing unit for processing Universal Dependencies data.

after_process_document(document)[source]

This method is called after each process_document.

apply_on_document(document)[source]
before_process_document(document)[source]

This method is called before each process_document.

process_bundle(bundle)[source]

Process a UD bundle

process_document(document)[source]

Process a UD document

process_end()[source]

A hook method that is executed after processing all UD data

process_node(_)[source]

Process a UD node

process_start()[source]

A hook method that is executed before processing UD data

process_tree(tree)[source]

Process a UD tree

udapi.core.bundle module

Bundle class represents one sentence.

class udapi.core.bundle.Bundle(bundle_id=None, document=None)[source]

Bases: object

Bundle represents one sentence in an UD document.

A bundle contains one or more trees. More trees are needed e.g. in case of parallel treebanks where each tree represents a translation of the sentence in a different languages. Trees in one bundle are distinguished by a zone label.

add_tree(root)[source]

Add an existing tree to the bundle.

address()[source]

Return bundle_id or ‘?’ if missing.

bundle_id

ID of this bundle.

check_zone(new_zone)[source]

Raise an exception if the zone is invalid or already exists.

create_tree(zone=None)[source]

Return the root of a newly added tree with a given zone.

document()[source]

Returns the document in which the bundle is contained.

get_tree(zone='')[source]

Returns the tree root whose zone is equal to zone.

has_tree(zone='')[source]

Does this bundle contain a tree with a given zone?

number
remove()[source]

Remove a bundle from the document.

trees

udapi.core.document module

Document class is a container for UD trees.

class udapi.core.document.Document[source]

Bases: object

Document is a container for Universal Dependency trees.

create_bundle()[source]

Create a new bundle and add it at the end of the document.

from_conllu_string(string)[source]

Load a document from a conllu-formatted string.

load_conllu(filename=None)[source]

Load a document from a conllu-formatted file.

store_conllu(filename)[source]

Store a document into a conllu-formatted file.

to_conllu_string()[source]

Return the document as a conllu-formatted string.

udapi.core.dualdict module

DualDict is a dict with lazily synchronized string representation.

class udapi.core.dualdict.DualDict(value=None, **kwargs)[source]

Bases: collections.abc.MutableMapping

DualDict class serves as dict with lazily synchronized string representation.

>>> ddict = DualDict('Number=Sing|Person=1')
>>> ddict['Case'] = 'Nom'
>>> str(ddict)
'Case=Nom|Number=Sing|Person=1'
>>> ddict['NonExistent']
''

This class provides access to both * a structured (dict-based, deserialized) representation,

e.g. {‘Number’: ‘Sing’, ‘Person’: ‘1’}, and
  • a string (serialized) representation of the mapping, e.g. Number=Sing|Person=1.

There is a clever mechanism that makes sure that users can read and write both of the representations which are always kept synchronized. Moreover, the synchronization is lazy, so the serialization and deserialization is done only when needed. This speeds up scenarios where access to dict is not needed.

A value can be deleted with any of the following three ways: >>> del ddict[‘Case’] >>> ddict[‘Case’] = None >>> ddict[‘Case’] = ‘’ and it works even if the value was already missing.

clear() → None. Remove all items from D.[source]
copy()[source]

Return a deep copy of this instance.

set_mapping(value)[source]

Set the mapping from a dict or string.

If the value is None or an empty string, it is converted to storing string _ (which is the CoNLL-U way of representing an empty value). If the value is a string, it is stored as is. If the value is a dict (or any instance of collections.abc.Mapping), its copy is stored. Other types of value raise an ValueError exception.

udapi.core.feats module

Feats class for storing morphological features of nodes in UD trees.

class udapi.core.feats.Feats(value=None, **kwargs)[source]

Bases: udapi.core.dualdict.DualDict

Feats class for storing morphological features of nodes in UD trees.

See http://universaldependencies.org/u/feat/index.html for the specification of possible feature names and values.

is_plural()[source]

Is the grammatical number plural (feats[‘Number’] contains ‘Plur’)?

is_singular()[source]

Is the grammatical number singular (feats[‘Number’] contains ‘Sing’)?

udapi.core.files module

Files is a helper class for iterating over filenames.

class udapi.core.files.Files(filenames=None, filehandle=None, encoding='utf-8')[source]

Bases: object

Helper class for iterating over filenames.

It is used e.g. in udapi.core.basereader (as self.files = Files(filenames=pattern)). Constructor takes various arguments: >>> files = Files([‘file1.txt’, ‘file2.txt’]) # list of filenames or >>> files = Files(‘file1.txt,file2.txt’) # comma- or space-separated filenames in string >>> files = Files(‘file1.txt,file2.txt.gz’) # supports automatic decompression of gz, xz, bz2 >>> files = Files(@my.filelist !dir??/file*.txt’) # @ marks filelist, ! marks wildcard pattern The @filelist and !wildcard conventions are used in several other tools, e.g. 7z or javac.

Usage: >>> while (True): >>> filename = files.next_filename()

if filename is None:
break

or >>> filehandle = files.next_filehandle()

filename

Property with the current file name.

has_next_file()[source]

Is there any other file in the queue after the current one?

next_filehandle()[source]

Go to the next file and retrun its filehandle or None (meaning no more files).

next_filename()[source]

Go to the next file and retrun its filename or None (meaning no more files).

number_of_files

Propery with the total number of files.

string_to_filenames(string)[source]

Parse a pattern string (e.g. ‘!dir??/file*.txt’) and return a list of matching filenames.

If the string starts with ! it is interpreted as shell wildcard pattern. If it starts with @ it is interpreted as a filelist with one file per line. The string can contain more filenames (or ‘!’ and ‘@’ patterns) separated by spaces or commas. For specifying files with spaces or commas in filenames, you need to use wildcard patterns or ‘@’ filelist. (But preferably don’t use such filenames.)

udapi.core.mwt module

MWT class represents a multi-word token.

class udapi.core.mwt.MWT(words=None, form=None, misc=None, root=None)[source]

Bases: object

Class for representing multi-word tokens in UD trees.

address()[source]

Full (document-wide) id of the multi-word token.

form
misc

Property for MISC attributes stored as a DualDict object.

See udapi.core.node.Node for details.

ord_range()[source]

Return a string suitable for the first column of CoNLL-U.

remove()[source]

Delete this multi-word token (but keep its words).

root
words

udapi.core.node module

Node class and related classes and functions.

In addition to class Node, this module contains class ListOfNodes and function find_minimal_common_treelet.

class udapi.core.node.ListOfNodes(iterable, origin)[source]

Bases: list

Helper class for results of node.children and node.descendants.

Python distinguishes properties, e.g. node.form … no brackets, and methods, e.g. node.remove() … brackets necessary. It is useful (and expected by Udapi users) to use properties, so one can do e.g. node.form += “suffix”. It is questionable whether node.parent, node.root, node.children etc. should be properties or methods. The problem of methods is that if users forget the brackets, the error may remain unnoticed because the result is interpreted as a method reference. The problem of properties is that they cannot have any parameters. However, we would like to allow e.g. node.children(add_self=True).

This class solves the problem: node.children and node.descendants are properties which return instances of this clas ListOfNodes. This class implements the method __call__, so one can use e.g. nodes = node.children nodes = node.children() nodes = node.children(add_self=True, following_only=True)

class udapi.core.node.Node(form=None, lemma=None, upos=None, xpos=None, feats=None, deprel=None, misc=None)[source]

Bases: object

Class for representing nodes in Universal Dependency trees.

Attributes form, lemma, upos, xpos and deprel are public attributes of type str, so you can use e.g. node.lemma = node.form.

node.ord is a int type public attribute for storing the node’s word order index, but assigning to it should be done with care, so the non-root nodes have ord`s 1,2,3… It is recommended to use one of the `node.shift_* methods for reordering nodes.

For changing dependency structure (topology) of the tree, there is the parent property, e.g. node.parent = node.parent.parent and node.create_child() method. Properties node.children and node.descendants return object of type ListOfNodes, so it is possible to do e.g. >>> all_children = node.children >>> left_children = node.children(preceding_only=True) >>> right_descendants = node.descendants(following_only=True, add_self=True)

Properties node.feats and node.misc return objects of type DualDict, so one can do e.g.: >>> node = Node() >>> str(node.feats) ‘_’ >>> node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’}` >>> node.feats = ‘Case=Nom|Person=1’ # equivalent to the above >>> node.feats[‘Case’] ‘Nom’ >>> node.feats[‘NonExistent’] ‘’ >>> node.feats[‘Case’] = ‘Gen’ >>> str(node.feats) ‘Case=Gen|Person=1’ >>> dict(node.feats) {‘Case’: ‘Gen’, ‘Person’: ‘1’}

Handling of enhanced dependencies, multi-word tokens and other node’s methods are described below.

address()[source]

Return full (document-wide) id of the node.

For non-root nodes, the general address format is: node.bundle.bundle_id + ‘/’ + node.root.zone + ‘#’ + node.ord, e.g. s123/en_udpipe#4. If zone is empty, the slash is excluded as well, e.g. s123#4.

children

Return a list of dependency children (direct dependants) nodes.

The returned nodes are sorted by their ord. Note that node.children is a property, not a method, so if you want all the children of a node (excluding the node itself), you should not use node.children(), but just

node.children
However, the returned result is a callable list, so you can use
nodes1 = node.children(add_self=True) nodes2 = node.children(following_only=True) nodes3 = node.children(preceding_only=True) nodes4 = node.children(preceding_only=True, add_self=True)
as a shortcut for
nodes1 = sorted([node] + node.children, key=lambda n: n.ord) nodes2 = [n for n in node.children if n.ord > node.ord] nodes3 = [n for n in node.children if n.ord < node.ord] nodes4 = [n for n in node.children if n.ord < node.ord] + [node]

See documentation of ListOfNodes for details.

compute_text(use_mwt=True)[source]

Return a string representing this subtree’s text (detokenized).

Compute the string by concatenating forms of nodes (words and multi-word tokens) and joining them with a single space, unless the node has SpaceAfter=No in its misc. If called on root this method returns a string suitable for storing in root.text (but it is not stored there automatically).

Technical details: If called on root, the root’s form (<ROOT>) is not included in the string. If called on non-root nodeA, nodeA’s form is included in the string, i.e. internally descendants(add_self=True) is used. Note that if the subtree is non-projective, the resulting string may be misleading.

Args: use_mwt: consider multi-word tokens? (default=True)

create_child(**kwargs)[source]

Create and return a new child of the current node.

create_empty_child(**kwargs)[source]

Create and return a new empty node child of the current node.

deprel
deps

Return enhanced dependencies as a Python list of dicts.

After the first access to the enhanced dependencies, provide the deserialization of the raw data and save deps to the list.

descendants

Return a list of all descendants of the current node.

The returned nodes are sorted by their ord. Note that node.descendants is a property, not a method, so if you want all the descendants of a node (excluding the node itself), you should not use node.descendants(), but just

node.descendants
However, the returned result is a callable list, so you can use
nodes1 = node.descendants(add_self=True) nodes2 = node.descendants(following_only=True) nodes3 = node.descendants(preceding_only=True) nodes4 = node.descendants(preceding_only=True, add_self=True)
as a shortcut for
nodes1 = sorted([node] + node.descendants, key=lambda n: n.ord) nodes2 = [n for n in node.descendants if n.ord > node.ord] nodes3 = [n for n in node.descendants if n.ord < node.ord] nodes4 = [n for n in node.descendants if n.ord < node.ord] + [node]

See documentation of ListOfNodes for details.

feats

Property for morphological features stored as a Feats object.

Reading: You can access node.feats as a dict, e.g. if node.feats[‘Case’] == ‘Nom’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.feats[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole FEATS (suitable for CoNLL-U), e.g. if node.feats == ‘Case=Nom|Person=1’.

Writing: All the following assignment types are supported: node.feats[‘Case’] = ‘Nom’ node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’} node.feats = ‘Case=Nom|Person=1’ node.feats = ‘_’ The last line has the same result as assigning None or empty string to node.feats.

For details about the implementation and other methods (e.g. node.feats.is_plural()), see udapi.core.feats.Feats which is a subclass of DualDict.

form
get_attrs(attrs, undefs=None, stringify=True)[source]

Return multiple attributes or pseudo-attributes, possibly substituting empty ones.

Pseudo-attributes: p_xy is the (pseudo) attribute xy of the parent node. c_xy is a list of the (pseudo) attributes xy of the children nodes. l_xy is the (pseudo) attribute xy of the previous (left in LTR langs) node. r_xy is the (pseudo) attribute xy of the following (right in LTR langs) node. dir: ‘left’ = the node is a left child of its parent,

‘right’ = the node is a rigth child of its parent, ‘root’ = the node’s parent is the technical root.

edge: length of the edge to parent (node.ord - node.parent.ord) or 0 if parent is root children: number of children nodes. siblings: number of siblings nodes. depth: depth in the dependency tree (technical root has depth=0, highest word has depth=1). feats_split: list of name=value formatted strings of the FEATS.

Args: attrs: A list of attribute names, e.g. ['form', 'lemma', 'p_upos']. undefs: A value to be used instead of None for empty (undefined) values. stringify: Apply str() on each value (except for None)

is_descendant_of(node)[source]

Is the current node a descendant of the node given as argument?

is_leaf()[source]

Is this node a leaf, ie. a node without any children?

is_nonprojective()[source]

Is the node attached to its parent non-projectively?

Is there at least one node between (word-order-wise) this node and its parent that is not dominated by the parent? For higher speed, the actual implementation does not find the node(s) which cause(s) the gap. It only checks the number of parent’s descendants in the span and the total number of nodes in the span.

is_nonprojective_gap()[source]

Is the node causing a non-projective gap within another node’s subtree?

Is there at least one node X such that - this node is not a descendant of X, but - this node is within span of X, i.e. it is between (word-order-wise)

X’s leftmost descendant (or X itself) and X’s rightmost descendant (or X itself).
static is_root()[source]

Is the current node a (technical) root?

Returns False for all Node instances, irrespectively of whether is has a parent or not. True is returned only by instances of udapi.core.root.Root.

lemma
misc

Property for MISC attributes stored as a DualDict object.

Reading: You can access node.misc as a dict, e.g. if node.misc[‘SpaceAfter’] == ‘No’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.misc[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole MISC (suitable for CoNLL-U), e.g. if node.misc == ‘SpaceAfter=No|X=Y’.

Writing: All the following assignment types are supported: node.misc[‘SpaceAfter’] = ‘No’ node.misc = {‘SpaceAfter’: ‘No’, ‘X’: ‘Y’} node.misc = ‘SpaceAfter=No|X=Y’ node.misc = ‘_’ The last line has the same result as assigning None or empty string to node.feats.

For details about the implementation, see udapi.core.dualdict.DualDict.

multiword_token

Return the multi-word token which includes this node, or None.

If this node represents a (syntactic) word which is part of a multi-word token, this method returns the instance of udapi.core.mwt.MWT. If this nodes is not part of any multi-word token, this method returns None.

next_node

Return the following node according to word order.

no_space_after

Boolean property as a shortcut for node.misc[“SpaceAfter”] == “No”.

ord
parent

Return dependency parent (head) node.

precedes(node)[source]

Does this node precedes another node in word order (self.ord < node.ord)?

prev_node

Return the previous node according to word order.

print_subtree(**kwargs)[source]

Print ASCII visualization of the dependency structure of this subtree.

This method is useful for debugging. Internally udapi.block.write.textmodetrees.TextModeTrees is used for the printing. All keyword arguments of this method are passed to its constructor, so you can use e.g.: files: to redirect sys.stdout to a file indent: to have wider trees attributes: to override the default list ‘form,upos,deprel’ See TextModeTrees for details and other parameters.

raw_deps

String serialization of enhanced dependencies as stored in CoNLL-U files.

After the access to the raw enhanced dependencies, provide the serialization if they were deserialized already.

remove(children=None)[source]

Delete this node and all its descendants.

Args: children: a string specifying what to do if the node has any children.

The default (None) is to delete them (and all their descendants). rehang means to re-attach those children to the parent of the removed node. warn means to issue a warning if any children are present and delete them. rehang_warn means to rehang and warn:-).
root

Return the (technical) root node of the whole tree.

sdeprel

Return the language-specific part of dependency relation.

E.g. if deprel = acl:relcl then sdeprel = relcl. If deprel=`acl` then sdeprel = empty string. If deprel is None then node.sdeprel will return None as well.

shift(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]

Internal method for changing word order.

shift_after_node(reference_node)[source]

Shift this node after the reference_node.

shift_after_subtree(reference_node, without_children=0)[source]

Shift this node (and its subtree) after the subtree rooted by reference_node.

Args: without_children: shift just this node without its subtree?

shift_before_node(reference_node)[source]

Shift this node after the reference_node.

shift_before_subtree(reference_node, without_children=0)[source]

Shift this node (and its subtree) before the subtree rooted by reference_node.

Args: without_children: shift just this node without its subtree?

udeprel

Return the universal part of dependency relation, e.g. acl instead of acl:relcl.

So you can write node.udeprel instead of node.deprel.split(‘:’)[0].

unordered_descendants()[source]

Return a list of all descendants in any order.

upos
xpos
udapi.core.node.find_minimal_common_treelet(*args)[source]

Find the smallest tree subgraph containing all nodes provided in args.

>>> from udapi.core.node import find_minimal_common_treelet
>>> (nearest_common_ancestor, _) = find_minimal_common_treelet(nodeA, nodeB)
>>> nodes = [nodeA, nodeB, nodeC]
>>> (nca, added_nodes) = find_minimal_common_treelet(*nodes)

There always exists exactly one such tree subgraph (aka treelet). This function returns a tuple (root, added_nodes), where root is the root of the minimal treelet and added_nodes is an iterator of nodes that had to be added to nodes to form the treelet. The nodes should not contain one node twice.

udapi.core.resource module

Utilities for downloading models and ither resources.

udapi.core.resource.require_file(path)[source]

Return absolute path to the file and download it if missing.

udapi.core.root module

Root class represents the technical root node in each tree.

class udapi.core.root.Root(zone=None, comment='', text=None, newpar=None, newdoc=None)[source]

Bases: udapi.core.node.Node

Class for representing root nodes (technical roots) in UD trees.

add_comment(string)[source]

Add a given string to root.comment separated by a newline and space.

address()[source]

Full (document-wide) id of the root.

The general format of root nodes is: root.bundle.bundle_id + ‘/’ + root.zone, e.g. s123/en_udpipe. If zone is empty, the slash is excluded as well, e.g. s123. If bundle is missing (could occur during loading), ‘?’ is used instead. Root’s address is stored in CoNLL-U files as sent_id (in a special comment).

bundle

Return the bundle which this tree belongs to.

comment
create_multiword_token(words=None, form=None, misc=None)[source]

Create and return a new multi-word token (MWT) in this tree.

The new MWT can be optionally initialized using the following args. Args: words: a list of nodes which are part of the new MWT form: string representing the surface form of the new MWT misc: misc attribute of the new MWT

descendants

Return a list of all descendants of the current node.

The nodes are sorted by their ord. This root-specific implementation returns all the nodes in the tree except the root itself.

empty_nodes
get_sentence(if_missing='detokenize')[source]

Return either the stored root.text or (if None) root.compute_text().

Args: if_missing: What to do if root.text is None? (default=detokenize)

  • detokenize: use root.compute_text() to compute the sentence.
  • empty: return an empty string
  • warn_detokenize, warn_empty: in addition emit a warning via logging.warning()
  • fatal: raise an exception
is_descendant_of(node)[source]

Is the current node a descendant of the node given as argument?

This root-specific implementation returns always False.

is_root()[source]

Return True for all Root instances.

json
multiword_tokens

Return a list of all multi-word tokens in this tree.

newdoc
newpar
parent

Return dependency parent (head) node.

This root-specific implementation returns always None.

remove(children=None)[source]

Remove the whole tree from its bundle.

Args: children: a string specifying what to do if the root has any children.

The default (None) is to delete them (and all their descendants). warn means to issue a warning.
sent_id

ID of this tree, stored in the sent_id comment in CoNLL-U.

shift(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]

Attempts at changing the word order of root result in Exception.

steal_nodes(nodes)[source]

Move nodes from another tree to this tree (append).

text
token_descendants

Return all tokens (one-word or multi-word) in the tree.

ie. return a list of core.Node and core.MWT instances, whose forms create the raw sentence. Skip nodes, which are part of multi-word tokens.

For example with: 1-2 vámonos _ 1 vamos ir 2 nos nosotros 3-4 al _ 3 a a 4 el el 5 mar mar

[n.form for n in root.token_descendants] will return [‘vámonos’, ‘al’, ‘mar’].

zone

Return zone (string label) of this tree.

udapi.core.run module

Class Run parses a scenario and executes it.

class udapi.core.run.Run(args)[source]

Bases: object

Processing unit that processes UD data; typically a sequence of blocks.

execute()[source]

Parse given scenario and execute it.

scenario_string()[source]

Return the scenario string.

Module contents