udapi.core package¶
Submodules¶
udapi.core.basereader module¶
BaseReader is the base class for all reader blocks.
-
class
udapi.core.basereader.
BaseReader
(files='-', filehandle=None, zone='keep', bundles_per_doc=0, encoding='utf-8', sent_id_filter=None, split_docs=False, ignore_sent_id=False, **kwargs)[source]¶ Bases:
udapi.core.block.Block
Base class for all reader blocks.
-
file_number
¶ Property with the current file number (1-based).
-
filehandle
¶ Property with the current file handle.
-
filename
¶ Property with the current filename.
-
filtered_read_tree
()[source]¶ Load and return one more tree matching the sent_id_filter.
This method uses read_tree() internally. This is the method called by process_document.
-
static
is_multizone_reader
()[source]¶ Can this reader read bundles which contain more zones?.
This implementation returns always True. If a subclass supports just one zone in file (e.g. read.Sentences), this method should be overriden to return False, so process_document can take advatage of this knowledge and optimize the reading (no buffer needed even if bundles_per_doc specified).
-
udapi.core.basewriter module¶
BaseWriter is the base class for all writer blocks.
-
class
udapi.core.basewriter.
BaseWriter
(files='-', filehandle=None, docname_as_file=False, encoding='utf-8', newline='n', **kwargs)[source]¶ Bases:
udapi.core.block.Block
Base class for all reader blocks.
-
file_number
¶ Property with the current file number (1-based).
-
filename
¶ Property with the current filehandle.
-
udapi.core.block module¶
Block class represents the basic Udapi processing unit.
udapi.core.bundle module¶
Bundle class represents one sentence.
-
class
udapi.core.bundle.
Bundle
(bundle_id=None, document=None)[source]¶ Bases:
object
Bundle represents one sentence in an UD document.
A bundle contains one or more trees. More trees are needed e.g. in case of parallel treebanks where each tree represents a translation of the sentence in a different languages. Trees in one bundle are distinguished by a zone label.
-
bundle_id
¶ ID of this bundle.
-
number
¶
-
trees
¶
-
udapi.core.document module¶
Document class is a container for UD trees.
udapi.core.dualdict module¶
DualDict is a dict with lazily synchronized string representation.
-
class
udapi.core.dualdict.
DualDict
(value=None, **kwargs)[source]¶ Bases:
collections.abc.MutableMapping
DualDict class serves as dict with lazily synchronized string representation.
>>> ddict = DualDict('Number=Sing|Person=1') >>> ddict['Case'] = 'Nom' >>> str(ddict) 'Case=Nom|Number=Sing|Person=1' >>> ddict['NonExistent'] ''
This class provides access to both * a structured (dict-based, deserialized) representation,
e.g. {‘Number’: ‘Sing’, ‘Person’: ‘1’}, and- a string (serialized) representation of the mapping, e.g. Number=Sing|Person=1.
There is a clever mechanism that makes sure that users can read and write both of the representations which are always kept synchronized. Moreover, the synchronization is lazy, so the serialization and deserialization is done only when needed. This speeds up scenarios where access to dict is not needed.
A value can be deleted with any of the following three ways: >>> del ddict[‘Case’] >>> ddict[‘Case’] = None >>> ddict[‘Case’] = ‘’ and it works even if the value was already missing.
-
set_mapping
(value)[source]¶ Set the mapping from a dict or string.
If the value is None or an empty string, it is converted to storing string _ (which is the CoNLL-U way of representing an empty value). If the value is a string, it is stored as is. If the value is a dict (or any instance of collections.abc.Mapping), its copy is stored. Other types of value raise an ValueError exception.
udapi.core.feats module¶
Feats class for storing morphological features of nodes in UD trees.
-
class
udapi.core.feats.
Feats
(value=None, **kwargs)[source]¶ Bases:
udapi.core.dualdict.DualDict
Feats class for storing morphological features of nodes in UD trees.
See http://universaldependencies.org/u/feat/index.html for the specification of possible feature names and values.
udapi.core.files module¶
Files is a helper class for iterating over filenames.
-
class
udapi.core.files.
Files
(filenames=None, filehandle=None, encoding='utf-8')[source]¶ Bases:
object
Helper class for iterating over filenames.
It is used e.g. in
udapi.core.basereader
(as self.files = Files(filenames=pattern)). Constructor takes various arguments: >>> files = Files([‘file1.txt’, ‘file2.txt’]) # list of filenames or >>> files = Files(‘file1.txt,file2.txt’) # comma- or space-separated filenames in string >>> files = Files(‘file1.txt,file2.txt.gz’) # supports automatic decompression of gz, xz, bz2 >>> files = Files(‘@my.filelist !dir??/file*.txt’) # @ marks filelist, ! marks wildcard pattern The @filelist and !wildcard conventions are used in several other tools, e.g. 7z or javac.Usage: >>> while (True): >>> filename = files.next_filename()
- if filename is None:
- break
…
or >>> filehandle = files.next_filehandle()
-
filename
¶ Property with the current file name.
-
next_filehandle
()[source]¶ Go to the next file and retrun its filehandle or None (meaning no more files).
-
next_filename
()[source]¶ Go to the next file and retrun its filename or None (meaning no more files).
-
number_of_files
¶ Propery with the total number of files.
-
string_to_filenames
(string)[source]¶ Parse a pattern string (e.g. ‘!dir??/file*.txt’) and return a list of matching filenames.
If the string starts with ! it is interpreted as shell wildcard pattern. If it starts with @ it is interpreted as a filelist with one file per line. The string can contain more filenames (or ‘!’ and ‘@’ patterns) separated by spaces or commas. For specifying files with spaces or commas in filenames, you need to use wildcard patterns or ‘@’ filelist. (But preferably don’t use such filenames.)
udapi.core.mwt module¶
MWT class represents a multi-word token.
udapi.core.node module¶
Node class and related classes and functions.
In addition to class Node, this module contains class ListOfNodes and function find_minimal_common_treelet.
-
class
udapi.core.node.
ListOfNodes
(iterable, origin)[source]¶ Bases:
list
Helper class for results of node.children and node.descendants.
Python distinguishes properties, e.g. node.form … no brackets, and methods, e.g. node.remove() … brackets necessary. It is useful (and expected by Udapi users) to use properties, so one can do e.g. node.form += “suffix”. It is questionable whether node.parent, node.root, node.children etc. should be properties or methods. The problem of methods is that if users forget the brackets, the error may remain unnoticed because the result is interpreted as a method reference. The problem of properties is that they cannot have any parameters. However, we would like to allow e.g. node.children(add_self=True).
This class solves the problem: node.children and node.descendants are properties which return instances of this clas ListOfNodes. This class implements the method __call__, so one can use e.g. nodes = node.children nodes = node.children() nodes = node.children(add_self=True, following_only=True)
-
class
udapi.core.node.
Node
(form=None, lemma=None, upos=None, xpos=None, feats=None, deprel=None, misc=None)[source]¶ Bases:
object
Class for representing nodes in Universal Dependency trees.
Attributes form, lemma, upos, xpos and deprel are public attributes of type str, so you can use e.g. node.lemma = node.form.
node.ord is a int type public attribute for storing the node’s word order index, but assigning to it should be done with care, so the non-root nodes have ord`s 1,2,3… It is recommended to use one of the `node.shift_* methods for reordering nodes.
For changing dependency structure (topology) of the tree, there is the parent property, e.g. node.parent = node.parent.parent and node.create_child() method. Properties node.children and node.descendants return object of type ListOfNodes, so it is possible to do e.g. >>> all_children = node.children >>> left_children = node.children(preceding_only=True) >>> right_descendants = node.descendants(following_only=True, add_self=True)
Properties node.feats and node.misc return objects of type DualDict, so one can do e.g.: >>> node = Node() >>> str(node.feats) ‘_’ >>> node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’}` >>> node.feats = ‘Case=Nom|Person=1’ # equivalent to the above >>> node.feats[‘Case’] ‘Nom’ >>> node.feats[‘NonExistent’] ‘’ >>> node.feats[‘Case’] = ‘Gen’ >>> str(node.feats) ‘Case=Gen|Person=1’ >>> dict(node.feats) {‘Case’: ‘Gen’, ‘Person’: ‘1’}
Handling of enhanced dependencies, multi-word tokens and other node’s methods are described below.
-
address
()[source]¶ Return full (document-wide) id of the node.
For non-root nodes, the general address format is: node.bundle.bundle_id + ‘/’ + node.root.zone + ‘#’ + node.ord, e.g. s123/en_udpipe#4. If zone is empty, the slash is excluded as well, e.g. s123#4.
-
children
¶ Return a list of dependency children (direct dependants) nodes.
The returned nodes are sorted by their ord. Note that node.children is a property, not a method, so if you want all the children of a node (excluding the node itself), you should not use node.children(), but just
node.children- However, the returned result is a callable list, so you can use
- nodes1 = node.children(add_self=True) nodes2 = node.children(following_only=True) nodes3 = node.children(preceding_only=True) nodes4 = node.children(preceding_only=True, add_self=True)
- as a shortcut for
- nodes1 = sorted([node] + node.children, key=lambda n: n.ord) nodes2 = [n for n in node.children if n.ord > node.ord] nodes3 = [n for n in node.children if n.ord < node.ord] nodes4 = [n for n in node.children if n.ord < node.ord] + [node]
See documentation of ListOfNodes for details.
-
compute_text
(use_mwt=True)[source]¶ Return a string representing this subtree’s text (detokenized).
Compute the string by concatenating forms of nodes (words and multi-word tokens) and joining them with a single space, unless the node has SpaceAfter=No in its misc. If called on root this method returns a string suitable for storing in root.text (but it is not stored there automatically).
Technical details: If called on root, the root’s form (<ROOT>) is not included in the string. If called on non-root nodeA, nodeA’s form is included in the string, i.e. internally descendants(add_self=True) is used. Note that if the subtree is non-projective, the resulting string may be misleading.
Args: use_mwt: consider multi-word tokens? (default=True)
-
deprel
¶
-
deps
¶ Return enhanced dependencies as a Python list of dicts.
After the first access to the enhanced dependencies, provide the deserialization of the raw data and save deps to the list.
-
descendants
¶ Return a list of all descendants of the current node.
The returned nodes are sorted by their ord. Note that node.descendants is a property, not a method, so if you want all the descendants of a node (excluding the node itself), you should not use node.descendants(), but just
node.descendants- However, the returned result is a callable list, so you can use
- nodes1 = node.descendants(add_self=True) nodes2 = node.descendants(following_only=True) nodes3 = node.descendants(preceding_only=True) nodes4 = node.descendants(preceding_only=True, add_self=True)
- as a shortcut for
- nodes1 = sorted([node] + node.descendants, key=lambda n: n.ord) nodes2 = [n for n in node.descendants if n.ord > node.ord] nodes3 = [n for n in node.descendants if n.ord < node.ord] nodes4 = [n for n in node.descendants if n.ord < node.ord] + [node]
See documentation of ListOfNodes for details.
-
feats
¶ Property for morphological features stored as a Feats object.
Reading: You can access node.feats as a dict, e.g. if node.feats[‘Case’] == ‘Nom’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.feats[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole FEATS (suitable for CoNLL-U), e.g. if node.feats == ‘Case=Nom|Person=1’.
Writing: All the following assignment types are supported: node.feats[‘Case’] = ‘Nom’ node.feats = {‘Case’: ‘Nom’, ‘Person’: ‘1’} node.feats = ‘Case=Nom|Person=1’ node.feats = ‘_’ The last line has the same result as assigning None or empty string to node.feats.
For details about the implementation and other methods (e.g. node.feats.is_plural()), see
udapi.core.feats.Feats
which is a subclass of DualDict.
-
form
¶
-
get_attrs
(attrs, undefs=None, stringify=True)[source]¶ Return multiple attributes or pseudo-attributes, possibly substituting empty ones.
Pseudo-attributes: p_xy is the (pseudo) attribute xy of the parent node. c_xy is a list of the (pseudo) attributes xy of the children nodes. l_xy is the (pseudo) attribute xy of the previous (left in LTR langs) node. r_xy is the (pseudo) attribute xy of the following (right in LTR langs) node. dir: ‘left’ = the node is a left child of its parent,
‘right’ = the node is a rigth child of its parent, ‘root’ = the node’s parent is the technical root.edge: length of the edge to parent (node.ord - node.parent.ord) or 0 if parent is root children: number of children nodes. siblings: number of siblings nodes. depth: depth in the dependency tree (technical root has depth=0, highest word has depth=1). feats_split: list of name=value formatted strings of the FEATS.
Args: attrs: A list of attribute names, e.g.
['form', 'lemma', 'p_upos']
. undefs: A value to be used instead of None for empty (undefined) values. stringify: Apply str() on each value (except for None)
-
is_nonprojective
()[source]¶ Is the node attached to its parent non-projectively?
Is there at least one node between (word-order-wise) this node and its parent that is not dominated by the parent? For higher speed, the actual implementation does not find the node(s) which cause(s) the gap. It only checks the number of parent’s descendants in the span and the total number of nodes in the span.
-
is_nonprojective_gap
()[source]¶ Is the node causing a non-projective gap within another node’s subtree?
Is there at least one node X such that - this node is not a descendant of X, but - this node is within span of X, i.e. it is between (word-order-wise)
X’s leftmost descendant (or X itself) and X’s rightmost descendant (or X itself).
-
static
is_root
()[source]¶ Is the current node a (technical) root?
Returns False for all Node instances, irrespectively of whether is has a parent or not. True is returned only by instances of udapi.core.root.Root.
-
lemma
¶
-
misc
¶ Property for MISC attributes stored as a DualDict object.
Reading: You can access node.misc as a dict, e.g. if node.misc[‘SpaceAfter’] == ‘No’. Features which are not set return an empty string (not None, not KeyError), so you can safely use e.g. if node.misc[‘MyExtra’].find(‘substring’) != -1. You can also obtain the string representation of the whole MISC (suitable for CoNLL-U), e.g. if node.misc == ‘SpaceAfter=No|X=Y’.
Writing: All the following assignment types are supported: node.misc[‘SpaceAfter’] = ‘No’ node.misc = {‘SpaceAfter’: ‘No’, ‘X’: ‘Y’} node.misc = ‘SpaceAfter=No|X=Y’ node.misc = ‘_’ The last line has the same result as assigning None or empty string to node.feats.
For details about the implementation, see
udapi.core.dualdict.DualDict
.
-
multiword_token
¶ Return the multi-word token which includes this node, or None.
If this node represents a (syntactic) word which is part of a multi-word token, this method returns the instance of udapi.core.mwt.MWT. If this nodes is not part of any multi-word token, this method returns None.
-
next_node
¶ Return the following node according to word order.
-
no_space_after
¶ Boolean property as a shortcut for node.misc[“SpaceAfter”] == “No”.
-
ord
¶
-
parent
¶ Return dependency parent (head) node.
-
prev_node
¶ Return the previous node according to word order.
-
print_subtree
(**kwargs)[source]¶ Print ASCII visualization of the dependency structure of this subtree.
This method is useful for debugging. Internally udapi.block.write.textmodetrees.TextModeTrees is used for the printing. All keyword arguments of this method are passed to its constructor, so you can use e.g.: files: to redirect sys.stdout to a file indent: to have wider trees attributes: to override the default list ‘form,upos,deprel’ See TextModeTrees for details and other parameters.
-
raw_deps
¶ String serialization of enhanced dependencies as stored in CoNLL-U files.
After the access to the raw enhanced dependencies, provide the serialization if they were deserialized already.
-
remove
(children=None)[source]¶ Delete this node and all its descendants.
Args: children: a string specifying what to do if the node has any children.
The default (None) is to delete them (and all their descendants). rehang means to re-attach those children to the parent of the removed node. warn means to issue a warning if any children are present and delete them. rehang_warn means to rehang and warn:-).
-
root
¶ Return the (technical) root node of the whole tree.
-
sdeprel
¶ Return the language-specific part of dependency relation.
E.g. if deprel = acl:relcl then sdeprel = relcl. If deprel=`acl` then sdeprel = empty string. If deprel is None then node.sdeprel will return None as well.
-
shift
(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]¶ Internal method for changing word order.
-
shift_after_subtree
(reference_node, without_children=0)[source]¶ Shift this node (and its subtree) after the subtree rooted by reference_node.
Args: without_children: shift just this node without its subtree?
-
shift_before_subtree
(reference_node, without_children=0)[source]¶ Shift this node (and its subtree) before the subtree rooted by reference_node.
Args: without_children: shift just this node without its subtree?
-
udeprel
¶ Return the universal part of dependency relation, e.g. acl instead of acl:relcl.
So you can write node.udeprel instead of node.deprel.split(‘:’)[0].
-
upos
¶
-
xpos
¶
-
-
udapi.core.node.
find_minimal_common_treelet
(*args)[source]¶ Find the smallest tree subgraph containing all nodes provided in args.
>>> from udapi.core.node import find_minimal_common_treelet >>> (nearest_common_ancestor, _) = find_minimal_common_treelet(nodeA, nodeB) >>> nodes = [nodeA, nodeB, nodeC] >>> (nca, added_nodes) = find_minimal_common_treelet(*nodes)
There always exists exactly one such tree subgraph (aka treelet). This function returns a tuple (root, added_nodes), where root is the root of the minimal treelet and added_nodes is an iterator of nodes that had to be added to nodes to form the treelet. The nodes should not contain one node twice.
udapi.core.resource module¶
Utilities for downloading models and ither resources.
udapi.core.root module¶
Root class represents the technical root node in each tree.
-
class
udapi.core.root.
Root
(zone=None, comment='', text=None, newpar=None, newdoc=None)[source]¶ Bases:
udapi.core.node.Node
Class for representing root nodes (technical roots) in UD trees.
-
address
()[source]¶ Full (document-wide) id of the root.
The general format of root nodes is: root.bundle.bundle_id + ‘/’ + root.zone, e.g. s123/en_udpipe. If zone is empty, the slash is excluded as well, e.g. s123. If bundle is missing (could occur during loading), ‘?’ is used instead. Root’s address is stored in CoNLL-U files as sent_id (in a special comment).
-
bundle
¶ Return the bundle which this tree belongs to.
-
comment
¶
-
create_multiword_token
(words=None, form=None, misc=None)[source]¶ Create and return a new multi-word token (MWT) in this tree.
The new MWT can be optionally initialized using the following args. Args: words: a list of nodes which are part of the new MWT form: string representing the surface form of the new MWT misc: misc attribute of the new MWT
-
descendants
¶ Return a list of all descendants of the current node.
The nodes are sorted by their ord. This root-specific implementation returns all the nodes in the tree except the root itself.
-
empty_nodes
¶
-
get_sentence
(if_missing='detokenize')[source]¶ Return either the stored root.text or (if None) root.compute_text().
Args: if_missing: What to do if root.text is None? (default=detokenize)
- detokenize: use root.compute_text() to compute the sentence.
- empty: return an empty string
- warn_detokenize, warn_empty: in addition emit a warning via logging.warning()
- fatal: raise an exception
-
is_descendant_of
(node)[source]¶ Is the current node a descendant of the node given as argument?
This root-specific implementation returns always False.
-
json
¶
-
multiword_tokens
¶ Return a list of all multi-word tokens in this tree.
-
newdoc
¶
-
newpar
¶
-
parent
¶ Return dependency parent (head) node.
This root-specific implementation returns always None.
-
remove
(children=None)[source]¶ Remove the whole tree from its bundle.
Args: children: a string specifying what to do if the root has any children.
The default (None) is to delete them (and all their descendants). warn means to issue a warning.
-
sent_id
¶ ID of this tree, stored in the sent_id comment in CoNLL-U.
-
shift
(reference_node, after=0, move_subtree=0, reference_subtree=0)[source]¶ Attempts at changing the word order of root result in Exception.
-
text
¶
-
token_descendants
¶ Return all tokens (one-word or multi-word) in the tree.
ie. return a list of core.Node and core.MWT instances, whose forms create the raw sentence. Skip nodes, which are part of multi-word tokens.
For example with: 1-2 vámonos _ 1 vamos ir 2 nos nosotros 3-4 al _ 3 a a 4 el el 5 mar mar
[n.form for n in root.token_descendants] will return [‘vámonos’, ‘al’, ‘mar’].
-
zone
¶ Return zone (string label) of this tree.
-
udapi.core.run module¶
Class Run parses a scenario and executes it.