udapi.block.tokenize.onwhitespace module¶
Block tokenize.OnWhitespace
- class udapi.block.tokenize.onwhitespace.OnWhitespace(keep_spaces=False, **kwargs)[source]¶
Bases:
Block
Base tokenizer, splits on whitespaces, fills SpaceAfter=No.
Use the parameter keep_spaces=True to preserve all whitespaces in the sentence in the UDPipe way, i.e. using the SpacesAfter and SpacesBefore features in the MISC field. It is backward compatible with CoNLL-U v2 SpaceAfter=No feature. That is, no following whitespace is marked by SpaceAfter=No and a single following space results in no whitespace-related markup. If loading the text using read.Sentences and all whitespaces need to be preserved (in order to be able to reconstruct the original document), the read.Sentences block must be called with rstrip=’’, `rstrip=
` or `rstrip= ` to prevent stripping the
- trailing whitespace, e.g.::
$> echo -e “Hello world “ | udapy read.Sentences $’rstrip=
‘ tokenize.OnWhitespace keep_spaces=1 write.Conllu
# sent_id = 1 # text = Hello world 1 Hello _ _ _ _ 0 _ _ SpacesAfter=s s 2 world _ _ _ _ 0 _ _ _
Note that the attribute SpaceAfter=No is missing for the token world, since it is followed by a single space.
- keep_spacesbool
preserve whitespaces by filling MISC attributes SpacesAfter and SpacesBefore (by default False)
- escape_whitespace_table = {9: '\\t', 10: '\\n', 13: '\\r', 32: '\\s'}¶