JL i[2dZddlZddlmZGddeZy)a The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding. Reference: Jon Dehdari. 2014. A Neurophysiologically-Inspired Statistical Language Model (Doctoral dissertation). Columbus, OH, USA: The Ohio State University. N) TokenizerIceZdZdZej ddfZej ddfZej ddfZej ddfZ ej dd fZ ej d d fZ ej d d fZ ej ddfZ ej ddfZej ddfZej ddfZej ddfZej ddfZej ddfZej ddfZej ddfZedZedZedZej dedd fZej dedd fZej dedd fZej d!d"fZej d#d$fZej d%d&fZej d'd&fZ ej d(d)fZ!ej d*d+fZ"ej d,dfZ#eeeeeee e e e eeee eeeee eeeee#gZ$d/d-Z%y.)0ToktokTokenizeru This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl >>> toktok = ToktokTokenizer() >>> text = u'Is 9.5 or 525,600 my favorite number?' >>> print(toktok.tokenize(text, return_str=True)) Is 9.5 or 525,600 my favorite number ? >>> text = u'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things' >>> print(toktok.tokenize(text, return_str=True)) The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things >>> text = u'¡This, is a sentence with weird» symbols… appearing everywhere¿' >>> expected = u'¡ This , is a sentence with weird » symbols … appearing everywhere ¿' >>> assert toktok.tokenize(text, return_str=True) == expected >>> toktok.tokenize(text) == [u'¡', u'This', u',', u'is', u'a', u'sentence', u'with', u'weird', u'»', u'symbols', u'…', u'appearing', u'everywhere', u'¿'] True   u1([،;؛¿!"\])}»›”؟¡%٪°±©®।॥…])z \1 u([({\[“‘„‚«‹「『])u ([–—])z& z&  z z\|z | u(?r>s   (W4jW4r