udapi.block.tokenize.onwhitespace module

Block tokenize.OnWhitespace

class udapi.block.tokenize.onwhitespace.OnWhitespace(keep_spaces=False, **kwargs)[source]

Bases: Block

Base tokenizer, splits on whitespaces, fills SpaceAfter=No.

Use the parameter keep_spaces=True to preserve all whitespaces in the sentence in the UDPipe way, i.e. using the SpacesAfter and SpacesBefore features in the MISC field. It is backward compatible with CoNLL-U v2 SpaceAfter=No feature. That is, no following whitespace is marked by SpaceAfter=No and a single following space results in no whitespace-related markup. If loading the text using read.Sentences and all whitespaces need to be preserved (in order to be able to reconstruct the original document), the read.Sentences block must be called with rstrip=’’, `rstrip=

` or `rstrip= ` to prevent stripping the

trailing whitespace, e.g.::

$> echo -e “Hello world “ | udapy read.Sentences $’rstrip=

‘ tokenize.OnWhitespace keep_spaces=1 write.Conllu

# sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=s s 2 world _ _ _ _ 0 _ _ _

Note that the attribute SpaceAfter=No is missing for the token world, since it is followed by a single space.

keep_spacesbool

preserve whitespaces by filling MISC attributes SpacesAfter and SpacesBefore (by default False)

escape_whitespace_table = {9: '\\t', 10: '\\n', 13: '\\r', 32: '\\s'}
process_tree(root)[source]

Process a UD tree

static tokenize_sentence(string)[source]

A method to be overriden in subclasses.