L iIddlmZmZmZmZmZmZddlmZm Z m Z m Z m Z ddl mZddlmZmZmZmZddlmZGdd eZy ) )DictIteratorListOptionalTupleUnion) AddedToken Tokenizerdecoderspre_tokenizerstrainers)BPE)BertNormalizer LowercaseSequenceunicode_normalizer_from_str) BaseTokenizerceZdZdZ d!deeeeeeffdeeee e eeffdeee fd ed ee d e d eed e de ffd ZededefdZdddgdgddfdeee efdedede eee fdede ed eede fdZdddgdgdddfdeeeeeefdedede eee fdede ed eede deefd ZxZS)"CharBPETokenizeraOriginal BPE Tokenizer Represents the BPE algorithm, as introduced by Rico Sennrich (https://arxiv.org/abs/1508.07909) The defaults settings corresponds to OpenAI GPT BPE tokenizers and differs from the original Sennrich subword-nmt implementation by the following options that you can deactivate: - adding a normalizer to clean up the text (deactivate with `bert_normalizer=False`) by: * removing any control characters and replacing all whitespaces by the classic one. * handle chinese chars by putting spaces around them. * strip all accents. - spitting on punctuation in addition to whitespaces (deactivate it with `split_on_whitespace_only=True`) NTvocabmerges unk_tokensuffixdropout lowercaseunicode_normalizerbert_normalizersplit_on_whitespace_onlyc |%|#tt|||t||} n ttt|||} | jt|| j t|gg} |r| t |gz } |r| t dgz } |r| tgz } t| dkDr)t| dkDrt| | _ n | d| _ | rtj| _ ntj| _ tj || _d||||||| d } t$ |M| | y) N)rrend_of_word_suffix)rrr$F)rrr)rr)modelrrrrr r!r")r rstr token_to_idadd_special_tokensrrrlenr normalizerr WhitespaceSplit pre_tokenizerBertPreTokenizerr BPEDecoderdecodersuper__init__)selfrrrrrrr r!r" tokenizer normalizers parameters __class__s o/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/implementations/char_level_bpe.pyr1zCharBPETokenizer.__init__s\  !3!#!)n'- I"#Idj"klI  Y 0 <  ( (#i.)9 :  78JKL LK  NU;< """4.(@   J/vocab_filenamemerges_filenamec Ntj||\}}t||fi|S)N)r read_filer)r9r:kwargsrrs r7 from_filezCharBPETokenizer.from_file\s( noF vv888r8i0ur ifiles vocab_size min_frequencyspecial_tokenslimit_alphabetinitial_alphabet show_progressc tj|||||||} t|tr|g}|jj || y)z%Train the model using the given filesr@rArBrCrDr$rE)trainerN)r BpeTrainer isinstancer& _tokenizertrain) r2r?r@rArBrCrDrrErHs r7rLzCharBPETokenizer.trainasS%%!'))-%'  eS !GE eW5r8iteratorlengthc xtj|||||||} |jj|| | y)z(Train the model using the given iteratorrG)rHrNN)rrIrKtrain_from_iterator) r2rMr@rArBrCrDrrErNrHs r7rPz$CharBPETokenizer.train_from_iterator{sK%%!'))-%'  ++  , r8) NNrrNFNTF)__name__ __module__ __qualname____doc__rrr&rintrrr floatboolr1 staticmethodr>rLrrP __classcell__)r6s@r7rr s, "7;>B,3#',0 $).A0c4S>123A0sDsCx$99:;A0j) A0  A0 % A0A0%SMA0A0#'A0F9#999 8?y"&( &"6S$s)^$66 6 U3 ?34 6  6s)6 66: 8?y"&( &" $  x '>>?    U3 ?34    s)      r8rN)typingrrrrrrr r r r rmodelsrr4rrrrbase_tokenizerrrr8r7r_s+??HHZZ)M }M r8