L iddlmZmZmZmZmZddlmZmZm Z m Z ddl m Z ddl mZddlmZddlmZddlmZGd d eZy ) )DictIteratorListOptionalUnion) AddedToken Tokenizerdecoderstrainers) WordPiece)BertNormalizer)BertPreTokenizer)BertProcessing) BaseTokenizerceZdZdZ d!deeeeeeffdeee fdeee fdeee fd eee fd eee fd e d e d ee de deffd Z e defdZ dddggdddfdeeeefdedededeedeeee fde defdZdddggddddfdeeeeeefdedededeedeeee fde dedeefd ZxZS)"BertWordPieceTokenizerzBert WordPiece TokenizerNT##vocab unk_token sep_token cls_token pad_token mask_token clean_texthandle_chinese_chars strip_accents lowercasewordpieces_prefixc >| tt|t|} nttt|} | jt|| j t|g| jt|| j t|g| jt|| j t|g| jt|| j t|g| jt|| j t|gt ||| | | _t| _|u| jt|} | td| jt|}| tdtt|| ft||f| _ tj| | _ d|||||||| | | d }t|=| |y)N)r)rrrrz%sep_token not found in the vocabularyz%cls_token not found in the vocabulary)prefix BertWordPiece) modelrrrrrrrrrr)r r str token_to_idadd_special_tokensr normalizerr pre_tokenizer TypeErrorrpost_processorr decodersuper__init__)selfrrrrrrrrrrr tokenizer sep_token_id cls_token_id parameters __class__s o/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/implementations/bert_wordpiece.pyr-zBertWordPieceTokenizer.__init__s  !)ES^"LMI!)c)n"EFI  Y 0 <  ( (#i.)9 :  Y 0 <  ( (#i.)9 :  Y 0 <  ( (#i.)9 :  Y 0 <  ( (#i.)9 :  Z 1 =  ( (#j/): ;-!!5'   #3"4   $00Y@L# GHH$00Y@L# GHH'5s9~|6TWZ[dWegsVt'uI $$..6GH %""""$$$8*"!2   J/c Dtj|}t|fi|S)N)r read_filer)rkwargss r4 from_filez BertWordPieceTokenizer.from_fileQs"##E*%e6v66r5i0ui)[PAD][UNK][CLS][SEP][MASK]files vocab_size min_frequencylimit_alphabetinitial_alphabetspecial_tokens show_progressc tj|||||||} t|tr|g}|jj || y)z%Train the model using the given filesrArBrCrDrErFcontinuing_subword_prefix)trainerN)r WordPieceTrainer isinstancer$ _tokenizertrain) r.r@rArBrCrDrErFrrJs r4rNzBertWordPieceTokenizer.trainVsS&++!')-)'&7  eS !GE eW5r5iteratorlengthc xtj|||||||} |jj|| | y)z(Train the model using the given iteratorrH)rJrPN)r rKrMtrain_from_iterator) r.rOrArBrCrDrErFrrPrJs r4rRz*BertWordPieceTokenizer.train_from_iteratorvsK(++!')-)'&7  ++  , r5) Nr<r>r=r;r?TTNTr)__name__ __module__ __qualname____doc__rrr$rintrboolr- staticmethodr9rrNrrR __classcell__)r3s@r4rr s@"7;,3,3,3,3-5%)(,!%@0c4S>123@0j)@0j) @0 j) @0 j) @0#z/*@0@0#@0 ~@0@0@0D777 "&(8 #!%6S$s)^$66 6  6 s) 6U3 ?34666F "&(8 #!% $!!  x '>>?! !  !  ! s) ! U3 ?34! ! !  !! r5rN)typingrrrrr tokenizersrr r r tokenizers.modelsr tokenizers.normalizersr tokenizers.pre_tokenizersrtokenizers.processorsrbase_tokenizerrrr5r4rcs.88@@'160)K ]K r5