udapi.block.ud.jointoken module

Block ud.JoinToken will join a given token with the preceding one.

class udapi.block.ud.jointoken.JoinToken(misc_name='JoinToken', misc_value=None, **kwargs)[source]

Bases: Block

Merge two tokens into one. A MISC attribute is used to mark the tokens that should join the preceding token. (The attribute may have been set by an annotator or by a previous block that tests the specific conditions under which joining is desired.) Joining cannot be done across sentence boundaries; if necessary, apply util.JoinSentence first. Multiword tokens are currently supported only partially: If the token consists of the current and the previous node only, they will be replaced by a node representing the surface token. Any other situation involving a MWT will be rejected. (The block ud.JoinAsMwt may be of some help, but it works differently.) Merging is simple if there is no space between the tokens (see SpaceAfter=No at the first token). If there is a space, there are three options in theory:

  1. Keep the tokens as two nodes but apply the UD goeswith relation (see https://universaldependencies.org/u/overview/typos.html) and the related annotation rules.

  2. Join them into one token that contains a space. Such “words with spaces” can be exceptionally allowed in UD if they are registered in the given language.

  3. Remove the space without any trace. Not recommended in UD unless the underlying text was created directly for UD and can be thus considered part of the annotation.

At present, this block does not support merging with spaces except for long numbers, for which it creates words with spaces (option 2).

process_node(node)[source]

The JoinToken (or equivalent) attribute in MISC will trigger action. Either the current node will be merged with the previous node and the attribute will be removed from MISC, or a warning will be issued that the merging cannot be done and the attribute will stay in MISC. Note that multiword token lines and empty nodes are not even scanned for the attribute, so if it is there, it will stay there but no warning will be printed.