JL idZddlmZmZddlmZmZGddeZGddeZGdd eZ Gd d eZ dd Z y )a Simple Tokenizers These tokenizers divide strings into substrings using the string ``split()`` method. When tokenizing using a particular delimiter string, use the string ``split()`` method directly, as this is more efficient. The simple tokenizers are *not* available as separate functions; instead, you should just use the string ``split()`` method directly: >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> s.split() # doctest: +NORMALIZE_WHITESPACE ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.'] >>> s.split(' ') # doctest: +NORMALIZE_WHITESPACE ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.'] >>> s.split('\n') # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] The simple tokenizers are mainly useful because they follow the standard ``TokenizerI`` interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a `CorpusReader`. )StringTokenizer TokenizerI)regexp_span_tokenizestring_span_tokenizeceZdZdZdZy)SpaceTokenizeraTokenize a string using the space character as a delimiter, which is the same as ``s.split(' ')``. >>> from nltk.tokenize import SpaceTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> SpaceTokenizer().tokenize(s) # doctest: +NORMALIZE_WHITESPACE ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']  N__name__ __module__ __qualname____doc___stringZ/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tokenize/simple.pyrr*sGrrceZdZdZdZy) TabTokenizerzTokenize a string use the tab character as a delimiter, the same as ``s.split('\t')``. >>> from nltk.tokenize import TabTokenizer >>> TabTokenizer().tokenize('a\tb c\n\t d') ['a', 'b c\n', ' d']  Nr rrrrr8sGrrc eZdZdZdZdZdZy) CharTokenizerzTokenize a string into individual characters. If this functionality is ever required directly, use ``for char in string``. Nct|SN)listselfss rtokenizezCharTokenizer.tokenizeKs Awrc#bKttdt|dzEd{y7w)N) enumeraterangelenrs r span_tokenizezCharTokenizer.span_tokenizeNs#U1c!fqj1222s %/-/)r r r rrrr$rrrrrDsG3rrc$eZdZdZddZdZdZy) LineTokenizeraTokenize a string into its lines, optionally discarding blank lines. This is similar to ``s.split('\n')``. >>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) # doctest: +NORMALIZE_WHITESPACE ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.'] :param blanklines: Indicates how blank lines should be handled. Valid values are: - ``discard``: strip blank lines out of the token list before returning it. A line is considered blank if it contains only whitespace characters. - ``keep``: leave all blank lines in the token list. - ``discard-eof``: if the string ends with a newline, then do not generate a corresponding token ``''`` after that newline. cXd}||vrtddj|z||_y)N)discardkeep discard-eofzBlank lines must be one of: %sr ) ValueErrorjoin _blanklines)r blanklinesvalid_blankliness r__init__zLineTokenizer.__init__is:= - -0388AANr()r r r rr0rr$rrrr&r&Rs,&>rr&c6t|j|Sr)r&r)textr.s r line_tokenizer=s  $ - -d 33rNr:) rnltk.tokenize.apirrnltk.tokenize.utilrrrrrr&r=rrrr@sH::I _  ?  3O 3/>J/>p4r