JL i6dZddlmZddlmZGddeZy)a( Multi-Word Expression Tokenizer A ``MWETokenizer`` takes a string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs: >>> from nltk.tokenize import MWETokenizer >>> tokenizer = MWETokenizer([('a', 'little'), ('a', 'little', 'bit'), ('a', 'lot')]) >>> tokenizer.add_mwe(('in', 'spite', 'of')) >>> tokenizer.tokenize('Testing testing testing one two three'.split()) ['Testing', 'testing', 'testing', 'one', 'two', 'three'] >>> tokenizer.tokenize('This is a test in spite'.split()) ['This', 'is', 'a', 'test', 'in', 'spite'] >>> tokenizer.tokenize('In a little or a little bit or a lot in spite of'.split()) ['In', 'a_little', 'or', 'a_little_bit', 'or', 'a_lot', 'in_spite_of'] ) TokenizerI)Triec$eZdZdZddZdZdZy) MWETokenizerzhA tokenizer that processes tokenized text and merges multi-word expressions into single tokens. Nc:|sg}t||_||_y)aInitialize the multi-word tokenizer with a list of expressions and a separator :type mwes: list(list(str)) :param mwes: A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings. :type separator: str :param separator: String that should be inserted between words in a multi-word expression token. (Default is '_') N)r_mwes _separator)selfmwes separators W/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tokenize/mwe.py__init__zMWETokenizer.__init__(sD$Z #c:|jj|y)aAdd a multi-word expression to the lexicon (stored as a word trie) We use ``util.Trie`` to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE. :param mwe: The multi-word expression we're adding into the word trie :type mwe: tuple(str) or list(str) :Example: >>> tokenizer = MWETokenizer() >>> tokenizer.add_mwe(('a', 'b')) >>> tokenizer.add_mwe(('a', 'b', 'c')) >>> tokenizer.add_mwe(('a', 'x')) >>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}} >>> tokenizer._mwes == expected True N)rinsert)r mwes r add_mwezMWETokenizer.add_mwe9s( #rcd}t|}g}||kr|||jvr|}|j}d}||kr5|||vr.|||}|dz}tj|vr|}||kr|||vr.|dkDr|}tj|vs|dkDr0|j |j j ||||}n3|j |||dz }n|j |||dz }||kr|S)a :param text: A list containing tokenized text :type text: list(str) :return: A list of the tokenized text with multi-words merged together :rtype: list(str) :Example: >>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+') >>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split()) ['An', "hors+d'oeuvre", 'tonight,', 'sir?'] r)lenrrLEAFappendr join)r textinresultjtrie last_matchs r tokenizezMWETokenizer.tokenizeOs  I!eAw$**$zz !eQ4Q=DAAyyD(%& !eQ4 "B&yyD(JO doo&:&:4!9&EF d1g.Q d1g&Q3!e4 r)N_)__name__ __module__ __qualname____doc__rrr"rr rr#s$",-rrN)r'nltk.tokenize.apir nltk.utilrrr(rr r+s .)Y:Yr