udapi.block.ud.jointoken module¶
Block ud.JoinToken will join a given token with the preceding one.
- class udapi.block.ud.jointoken.JoinToken(misc_name='JoinToken', misc_value=None, **kwargs)[source]¶
Bases:
BlockMerge two tokens into one. A MISC attribute is used to mark the tokens that should join the preceding token. (The attribute may have been set by an annotator or by a previous block that tests the specific conditions under which joining is desired.) Joining cannot be done across sentence boundaries; if necessary, apply util.JoinSentence first. Multiword tokens are currently not supported: None of the nodes to be merged can belong to a MWT. (The block ud.JoinAsMwt may be of some help, but it works differently.) Merging is simple if there is no space between the tokens (see SpaceAfter=No at the first token). If there is a space, there are three options in theory:
Keep the tokens as two nodes but apply the UD goeswith relation (see https://universaldependencies.org/u/overview/typos.html) and the related annotation rules.
Join them into one token that contains a space. Such “words with spaces” can be exceptionally allowed in UD if they are registered in the given language.
Remove the space without any trace. Not recommended in UD unless the underlying text was created directly for UD and can be thus considered part of the annotation.
At present, this block does not support merging with spaces at all, but in the future one or more of the options may be added.
- process_node(node)[source]¶
The JoinToken (or equivalent) attribute in MISC will trigger action. Either the current node will be merged with the previous node and the attribute will be removed from MISC, or a warning will be issued that the merging cannot be done and the attribute will stay in MISC. Note that multiword token lines and empty nodes are not even scanned for the attribute, so if it is there, it will stay there but no warning will be printed.