L iddlmZmZmZmZmZmZddlmZm Z m Z m Z m Z m Z ddlmZddlmZmZmZddlmZGddeZy ) )DictIteratorListOptionalTupleUnion) AddedToken Tokenizerdecoderspre_tokenizers processorstrainers)BPE) LowercaseSequenceunicode_normalizer_from_str) BaseTokenizercleZdZdZ ddeeeeeeffdeeee e eeffde de dee deed eed eed e ffd Z ed edefdZdddgfdeee efdedede de eeeff dZdddgdfdeeeeeefdedede de eeefdeef dZxZS)ByteLevelBPETokenizerzjByteLevelBPETokenizer Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model Nvocabmergesadd_prefix_space lowercasedropoutunicode_normalizercontinuing_subword_prefixend_of_word_suffix trim_offsetsc |$|"tt||||xsd|xsd} ntt} g} |r| t|gz } |r| tgz } t | dkDr)t | dkDrt | | _n | d| _tj|| _ tj| _ tj| | _ d||||||| d} t |=| | y) N)rrrrr)r)r ByteLevelBPE)modelrrrrrrr)r rrrlenr normalizerr ByteLevel pre_tokenizerr decoderr post_processorsuper__init__)selfrrrrrrrrr tokenizer normalizers parameters __class__s o/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/implementations/byte_level_bpe.pyr+zByteLevelBPETokenizer.__init__s  !3!#.G.M2'9'?R I"#%(I  78JKL LK  IK= (K { a ;!#'/ '< $'21~ $"0":":L\"] $..0 #-#7#7\#R  $ 0""4)B"4(   J/vocab_filenamemerges_filenamec Ntj||\}}t||fi|S)N)r read_filer)r3r4kwargsrrs r1 from_filezByteLevelBPETokenizer.from_fileJs( noF v$UF=f==r2i0uTfiles vocab_size min_frequency show_progressspecial_tokensctj||||tjj }t |t r|g}|jj||y)z%Train the model using the given filesr;r<r=r>initial_alphabet)trainerN) r BpeTrainerr r&alphabet isinstancestr _tokenizertrain)r,r:r;r<r=r>rBs r1rHzByteLevelBPETokenizer.trainOs\%%!'')+55>>@   eS !GE eW5r2iteratorlengthctj||||tjj }|j j |||y)z(Train the model using the given iteratorr@)rBrJN)rrCr r&rDrGtrain_from_iterator)r,rIr;r<r=r>rJrBs r1rLz)ByteLevelBPETokenizer.train_from_iteratordsT%%!'')+55>>@   ++  , r2) NNFFNNNNF)__name__ __module__ __qualname____doc__rrrFrintrrboolfloatr+ staticmethodr8r rHrrL __classcell__)r0s@r1rr s7;>B!&#',037,0"80c4S>12380sDsCx$99:;80 80  80 % 80%SM80$,C=80%SM8080t>#>>> "79 6S$s)^$66 6  6 U3 ?34 60 "79 $  x '>>?      U3 ?34    r2rN)typingrrrrrr tokenizersr r r r r rtokenizers.modelsrtokenizers.normalizersrrrbase_tokenizerrrr2r1r\s+??\\!SS)p Mp r2