L idZddlZddlZddlZddlmZddlmZddlm Z m Z m Z ddl m ZddlmZddlmZddlmZdd lmZmZmZmZd d lmZd d lmZd d lm Z d dl!m"Z"d dl#m$Z$m%Z%m&Z&m'Z'm(Z(m)Z)m*Z*m+Z+m,Z,m-Z-d dl.m/Z/m0Z0m1Z1e1jde3Z4dZ5dZ6dZ7dZ8dZ9e$dz Z$eeeedZ:e5e8dZ;e0e$Gdde)Zee defd?Z; dKd@e%ee dLdFZ?xZ@S)MPreTrainedTokenizerFastaQ Base class for all fast tokenizers (wrapping HuggingFace tokenizers library). Inherits from [`~tokenization_utils_base.PreTrainedTokenizerBase`]. Handles all the shared methods for tokenization and special tokens, as well as methods for downloading/caching/loading pretrained tokenizers, as well as adding tokens to the vocabulary. This class also contains the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece...). Nslow_tokenizer_classc > |jdd}|jdd}|jdd}|jdd}|jdd}|jdi}|jdd|_|r||j t d |t j |} n||stj|} n|r t|} n|lt|jd } | d d } | d } | d} t| | \} }|j| t|dkDr|j|nx|j"|dur|j|i|}t|} nJ|s=|jd |_|jdg|_t|d} d}n t d| |_||j|j"d|_|j j&}|q|j j(d$i||j+d|d|j+d|d|j+d|d|j+d|dn|j j-|j j.}||j j0d$i||j+d|d|j+d|d|j+d|d|j+d|d|j+d|dt3|hd$i||j6|j _|j:Dchc]}t=t?|}}tA|jCd !Dcgc]\}}t=t?||vr|}}}tE|jFjI|Dcgc] }tK|c}z}||jLDcgc]}||vs||vs |c}z }t|dkDrg}|jN}|D]p}tQ|tRr|jTxstK||vn tK||v}tQ|tJrtS||"}n||_*|jW|r|r|jY| t[j\|j^j`jc}|jd|j|jk7rFtetf|jd#}|j|d<|d$i||j^_0yycc}wcc}}wcc}wcc}w#th$rYywxYw)%Ntokenizer_object__slow_tokenizer gguf_filer% from_slowFadded_tokens_decoderadd_prefix_spacezCannot instantiate this tokenizer from a slow version. If it's based on sentencepiece, make sure you have sentencepiece installed.r&config model_type tokenizertokenizer_configradditional_special_tokensT) from_tiktokena9Couldn't instantiate the backend tokenizer from one of: (1) a `tokenizers` library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece or tiktoken installed to convert a slow tokenizer to a fast one. max_lengthtruncation_side directionstridetruncation_strategystrategy pad_tokenpad_token_type_id pad_type_id padding_sidelengthpad_to_multiple_ofc |dSNr)xs j/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/tokenization_utils_fast.pyz2PreTrainedTokenizerFast.__init__..s STUVSWkey)specialtyperE)5popgetr0r) ValueErrorcopydeepcopy TokenizerFast from_filerrrupdatelenr&r5 _tokenizer init_kwargs_decode_use_source_tokenizer truncationenable_truncation setdefault no_truncationpaddingenable_paddingsuper__init__split_special_tokensencode_special_tokensr/hashreprsorteditemslistadded_tokens_encoderkeysstrall_special_tokens_extendedall_special_tokens isinstancerrLappend add_tokensjsonloadsbackend_tokenizer pre_tokenizer __getstate__getattrpre_tokenizers_fast Exception)selfargskwargsr+slow_tokenizerr-fast_tokenizer_filer.r/fast_tokenizer gguf_param architecturetokenizer_dictr4additional_kwargs _truncation_paddingtokenadded_tokens_decoder_hashindex tokens_to_addencodertokensspecial_tokens is_special pre_tok_state pre_tok_class __class__s rGraz PreTrainedTokenizerFast.__init__bsv!::&8$?$6=JJ{D1 $jj)94@JJ{E2 %zz*@"E & +=u E /D4M4M4U0   '!]]+;   3[5L M OO ) ) +??**   *DOO * * 6X 6   k8K+@ A   18M3J K   nh{.C D   lHX,> ?   2H=Q4R S "6"040I0I-DHD]D]$^5T$u+%6$^!$^!'';'A'A'C X uDK (AA   t005578Ta;b5CJ;bb#?? 5PWCW\aiv\vE   }  !F!44N& %"%4]]Bc%jN&BU~5 eS)&ujAE$.EM e$ %'  JJt'='='K'K'X'X'Z[M  !3T5J5JKtOdOdd '(;]=N=Nv=V W 484I4I 017D7U}7U&&4e?%_ PreTrainedTokenizerFast.added_tokens_encoder.. dhijdkrIrJrfr/rgcontentryvks rGriz,PreTrainedTokenizerFast.added_tokens_encoders; *00I0I0O0O0QWk)lmA 1 mmmAc6|jjS)z Returns the added tokens in the vocabulary as a dictionary of index to AddedToken. Returns: `dict[str, int]`: The added tokens. )rWget_added_tokens_decoderrs rGr/z,PreTrainedTokenizerFast.added_tokens_decoders7799rIct|jjdDcic]\}}|j|c}}Scc}}w)z Returns the added tokens in the vocabulary as a dictionary of token to index. Returns: `dict[str, int]`: The added tokens. c |dSrDrErs rGrHz9PreTrainedTokenizerFast.get_added_vocab..rrIrJrrs rGget_added_vocabz'PreTrainedTokenizerFast.get_added_vocabs;*00I0I0O0O0QWk)lmA 1 mmmrcy)zN Returns True, to avoid expensive `assert tokenizer` gotchas. TrErs rG__bool__z PreTrainedTokenizerFast.__bool__srIc:|jjdS)zD Size of the full vocabulary with the added tokens. Trrrs rG__len__zPreTrainedTokenizerFast.__len__s---EErIc|jS)zc `tokenizers.implementations.BaseTokenizer`: The Rust tokenizer used as a backend. )rWrs rGrsz)PreTrainedTokenizerFast.backend_tokenizer%s rIc.|jjS)zU `tokenizers.decoders.Decoder`: The Rust decoder for this tokenizer. )rWdecoderrs rGrzPreTrainedTokenizerFast.decoder,s &&&rIFTencodingreturn_token_type_idsreturn_attention_maskreturn_overflowing_tokensreturn_special_tokens_maskreturn_offsets_mapping return_lengthverbosec J|d|jv}|d|jv}|r|j|g|jz} n|g} tt} | D]} | dj | j |r| dj | j |r| dj | j|r| dj | j|r| dj | j|s| dj t| j | | fS)a Convert the encoding representation (from low-level HuggingFace tokenizer output) to a python Dict and a list of encodings, take care of building a batch from overflowing tokens. Overflowing tokens are converted to additional examples (like batches) so the output values of the dict are lists (overflows) of lists (tokens). Output shape: (overflows, sequence length) token_type_idsattention_mask input_idsspecial_tokens_maskoffset_mappingrA) model_input_names overflowingrrhroidstype_idsrroffsetsrV) ryrrrrrrrr encodings encoding_dictes rG_convert_encodingz)PreTrainedTokenizerFast._convert_encoding3s$( ! ($48N8N$N ! ($48N8N$N ! $)=)=)I! X%9%99I! I#D)  ;A + & - -aee 4$./66qzzB$./66q7G7GH)34;;A fc ";;FC CMSTE88?TTTsArcX|jj|}| |jS|Sr)rW token_to_id unk_token_id)ryrrs rGrz;PreTrainedTokenizerFast._convert_token_to_id_with_added_vocrs,++E2 =$$ $ rIrcJ|jjt|Sr)rW id_to_tokenint)ryrs rG_convert_id_to_tokenz,PreTrainedTokenizerFast._convert_id_to_tokenxs**3u:66rI new_tokenscr|r|jj|S|jj|Sr)rWadd_special_tokensrp)ryrrs rG _add_tokensz#PreTrainedTokenizerFast._add_tokens{s/ ??55jA A))*55rIpairc8|jj|S)aG Returns the number of added tokens when encoding a sequence with special tokens. This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop. Args: pair (`bool`, *optional*, defaults to `False`): Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Returns: `int`: Number of special tokens added to sequences. )rWnum_special_tokens_to_add)ryrs rGrz1PreTrainedTokenizerFast.num_special_tokens_to_adds&88>>rIrskip_special_tokensc$t|tr|jj|Sg}|rt |j n t }|D]<}t|}||vr|j |jj|>|S)a Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens. Args: ids (`int` or `list[int]`): The token id (or token ids) to convert to tokens. skip_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not to remove special tokens in the decoding. Returns: `str` or `list[str]`: The decoded token(s). )rnrrWrsetall_special_idsro)ryrrr ids_to_skiprs rGconvert_ids_to_tokensz-PreTrainedTokenizerFast.convert_ids_to_tokenss c3 ??..s3 33Fc$../CE  >EJE # MM$//55e< =  >  rItextrc J|jd|||d|jS)N)r text_pairrrE) encode_plusr)ryrrrr{s rGtokenizez PreTrainedTokenizerFast.tokenizes,tkTTN`kdjkrrttrIpadding_strategyr;r7r:rBr@c|jj}|jj}|tjk(r|||jj na|||j |jd} |d} n | D cic]} | |j| d} } | | k7r|jjdi| |tjk(r||jjyy|tjk(r|nd} | ||n |j|j|j |j"|d} || k7r|jj$di| yycc} w)a Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers library) and restore the tokenizer settings afterwards. The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed section. Args: padding_strategy ([`~utils.PaddingStrategy`]): The kind of padding that will be applied to the input truncation_strategy ([`~tokenization_utils_base.TruncationStrategy`]): The kind of truncation that will be applied to the input max_length (`int`): The maximum size of a sequence. stride (`int`): The stride to use when handling overflow. pad_to_multiple_of (`int`, *optional*): If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability `>= 7.5` (Volta). padding_side (`str`, *optional*): The side on which the model should have padding applied. Should be selected between ['right', 'left']. Default value is picked from the class attribute of the same name. N)r7r:r<r9)rAr9pad_idr=r?rBrE)rWrZr^rDO_NOT_TRUNCATEr]valuer8rOr[r DO_NOT_PAD no_padding MAX_LENGTHr@ pad_token_idr=r>r_) ryrr;r7r:rBr@rrtargetcurrentrrAs rGset_truncation_and_paddingz2PreTrainedTokenizerFast.set_truncation_and_paddingsZBoo00 ??** "4"D"D D&--/) /55!11 F"@FG11kooa66GG& 111;F; 99 9#**,$$47Q7Q#QZW[F -9-E\4K\K\++!^^#55&8 F6!...88"%HsErbatch_text_or_text_pairsis_split_into_wordsreturn_tensorsrbc t|ttfstdt |d|j |||||| |j j|k7r||j _|j j|||}|Dcgc]}|j|| | | ||||}}i}|ddD]'}|Dcgc]\}}||D]}|}}}}|||<)|Dcgc]\}}|D]}|}}}}| r2g}t|D]\}\}}||gt|dzz }||d<|dD]} |j| ||t||| Scc}wcc}}}wcc}}}w) Nz:batch_text_or_text_pairs has to be a list or a tuple (got ))rr;r7r:rBr@)ris_pretokenized)rrrrrrrrrroverflow_to_sample_mapping) tensor_type)rntuplerh TypeErrorrMrrWrc encode_batchr enumeraterV&_eventual_warn_about_too_long_sequencer)!ryrrrr;r7r:rrBr@rrrrrrrrrbrrtokens_and_encodingssanitized_tokensrKr_rstacksanitized_encodingsritoksrs! rG_batch_encode_plusz*PreTrainedTokenizerFast._batch_encode_pluss.2UDMBLTRjMkLllmn  ''- 3!1% (  ?? 0 04H H4HDOO 1OO00 $1/1 .&   " "!&;&;*C+E'=+ #   ('*1- *C&:NN74DINqQNQNEN$) S ! *1ESSWQdSqSqSS %)+ & )*> ? K 9D!*qcC[8I4J.JJ* K=W 9 :)+6 XI  7 7 :w W X-/BP^__I ,OSs E$E) .E0rc |r||fgn|g}|j|fid|d|d|d|d|d|d| d| d | d | d | d |d |d|d|d|d||}| `|s^t|jDcic].\}}|t|dkDrt |dt r|dn|0c}}|j }|j|d|||Scc}}w)Nrrrr;r7r:rBr@rrrrrrrrrbrr)rrrgrVrnrhrr )ryrrrrr;r7r:rrBr@rrrrrrrrrbr{ batched_inputbatched_outputrKrs rG _encode_plusz$PreTrainedTokenizerFast._encode_plus[sg.09$ *+tf 000  3  2 .  !4  "    2 & * #8 #8 '@ (B $: (! "# $"6' 0  !*C*'5&:&:&<"Uc%j1nE!Hd9S%(Y^^(( N 33N;4OQ[]des-3C c|jj%|jjj|Sdj|S)N )rsrdecodejoin)ryrs rGconvert_tokens_to_stringz0PreTrainedTokenizerFast.convert_tokens_to_stringsJ%%--9  " " * * 1 1& 9 &! rI token_idsclean_up_tokenization_spacesc |jdd|_t|tr|g}|jj ||}||n |j }|r|j|}|S|S)Nuse_source_tokenizerF)r)rNrYrnrrWrrclean_up_tokenization)ryrrrr{r clean_texts rG_decodezPreTrainedTokenizerFast._decodes~-3JJ7Mu,U) i %" I%%iEX%Y,7 )22 % (33D9J KrIsave_directory file_names. legacy_formatfilename_prefixct|}|j|dur td|duxs|duxr|jduxr |j}|duxs|du}|rtj j ||r|dzndtz}|jjD cic]\}} | |jk\s|| } }} | rDt|dd 5} tj| d dd d z} | j| ddd|j|| } || z|fz}|rOtj j ||r|dzndt z}|j"j%|||fz}|Scc} }w#1swY~xYw)z Save a tokenizer using the slow-tokenizer/legacy format: vocabulary + added tokens as well as in a unique JSON file containing {config + vocab + added-tokens}. NTzYour tokenizer does not have a legacy version defined and therefore cannot register this version. You might consider leaving the legacy_format at `None` or setting it to `False`.F-wzutf-8)r)indent sort_keys ensure_ascii )r%)rkr)rPrrrrADDED_TOKENS_FILErirgropenrqdumpswritesave_vocabularyTOKENIZER_FILErssave)ryr"r#r$r% save_slow save_fastadded_tokens_filetokr added_vocabfout_str vocab_filesr%s rG_save_pretrainedz(PreTrainedTokenizerFast._save_pretraineds^,  $ $ ,$1F`  d " ;mt&; -))5 -,,  "T)C]e-C  " /3!6rUf f! 9=8Q8Q8W8W8Yv*#u]bfjfufu]u3:vKv+S7C%q"jjQ$]bcfjjGGGG$%..~._K#k15F4HHJ WW\\/3!6rUc cN  " " ' ' 7#~&77J!w%%s&E;>E;.FF c tj|jj}|j d}|j d} d} |dddk(ri|dd<g|dd<np|ddd k(r=|dd ]|dd } |dd| d } | | |vr|| } d |dd <| d gg|dd<n(|ddd vr i|dd<nt d|ddd|"d|dvr|dd|vr||dd|dd<t jtj|} g} |D]b}|j dd}|j dd}|ddd k7r|s5||d|vr ||d|d<| jtd'i|d|| j||dddk(rd|vr|dd |dd|d<|dddk(rd|vr|dd |dd|d<|ddd k(r| | |d<|dV|dddk(s*|dddk(r@d|dvr9td|ddDr!tjj|d<t |dd}|d'|| d|}| j#|||| &tj| j}d| vr| dD]}| d|d}||Dcgc]}|j%||}}|| d|d<|D] }| j'|}|t d |Dcgc]}| j'|c}| d|d!<d"D]?}|| vs| |\}}| ||vr||}| j'|}| t d ||g| |<A| |d<t jtj|} |j(j+}t,j.j+}|j1d#|D]}t3||t3||}| ||vr||}|j4j%|d}t7|tr=t||j8|j:|j<|j>d$%||<|||<|j@}||j|tC|d kDr||d#<|jDd'd&| i|Scc}wcc}w)(uf Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one. Args: text_iterator (generator of `list[str]`): The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts if you have everything in memory. vocab_size (`int`): The size of the vocabulary you want for your tokenizer. length (`int`, *optional*): The total number of sequences in the iterator. This is used to provide meaningful progress tracking new_special_tokens (list of `str` or `AddedToken`, *optional*): A list of new special tokens to add to the tokenizer you are training. special_tokens_map (`dict[str, str]`, *optional*): If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special token name to new special token name in this argument. kwargs (`dict[str, Any]`, *optional*): Additional keyword arguments passed along to the trainer from the 🤗 Tokenizers library. Returns: [`PreTrainedTokenizerFast`]: A new tokenizer of the same type as the original one, trained on `text_iterator`. added_tokenspost_processorNmodelrMr!rmergesr"unk_idrg)r#r$z;This method does not support this type of tokenizer (found z-) only BPE, Unigram, WordLevel and WordPiece. unk_tokenrLidrcontinuing_subword_prefixend_of_word_suffixrt ByteLevelSequence pretokenizersc3,K|] }|ddk(yw)rMrINrE).0 pretokenizers rG zBPreTrainedTokenizerFast.train_new_from_iterator..Ps"$!(K7sinitial_alphabet)rr)rAtrainerrrzQAttempted to set a token in the post processor that does not exist in the mappingr)clssepr5T) single_wordlstriprstrip normalizedrLr+rE)#rqrrrWto_strrNrPrSfrom_strr1rorextendanyrwrIalphabetMODEL_TO_TRAINER_MAPPINGtrain_from_iteratorrOrrXrQrSPECIAL_TOKENS_ATTRIBUTESremoverv_special_tokens_maprnrTrUrVrWr5rVr)ry text_iteratorrrAnew_special_tokensspecial_tokens_mapr{tokenizer_jsonr@rArErDr3r added_tokenrLr  trainer_classrQtrained_tokenizer_jsonrKrrtoken_id special_tokenspecial_tokens_listspecial_token_fullr5s rGtrain_new_from_iteratorz/PreTrainedTokenizerFast.train_new_from_iteratorsDDOO$:$:$<=%)).9 '++,<= ' "6 *e 3/1N7 #G ,02N7 #H - G $V , 9g&x0<'0:*73G4D3Ew'0 G $V ,0J J/1N7 #G ,Mn]dNeflNmMno>>   *~g66w' 48JJ3EnU\F]^iFj3kN7 #K 0!**4::n+EF ' =K!ooi6Gd+Ag&v.);G!-+i2HL^2^);K 1)*:; vC+,<=cB8LF)5TZ![5"4"8"8"F![![FLN#34S9(C!'#,#8#8#?#+", s#ouCuejIDYDYZ_D`CuN#34S9%@ v"0 F  N2-m?( 2EtU#/ 'e 4 %1mGY6Y$6}$EM%)%=%=%A%A%%N"0*=$.%$6$B$B188188#5#@#@ $ %F5M%2F5M% 2(%)$B$B!  ) % , ,-? @ ( )A -2KF. /t~~CyCFCCq"\Dvs SS)NNFFFFT)F)NF)FN)NN)NNN)A__name__ __module__ __qualname____doc__VOCAB_FILES_NAMESrr)rrMr__annotations__rapropertyboolrrrrdictrkrrrirr/rrrrSrs DecoderFastr EncodingFastrrrhrrrrrrrrrrrrrrrrrrrrrrrr!rPathLiker>rm __classcell__)rs@rGr(r(Qs *@D(4(;#<=Dzx   GCGG A4S>A tCH~  nd38nnn:d3 ?&;::nc3hn$ FF = '''1504*/+0',#-(-( (~-( (~ -( $( -( %) -(!%-(-(-( tCH~tL11 2-(^UE#x}2D,EU%PSUYZ]U^P^J_U  7#7(3-76d5j+A&B6]`6 ?d?s?,GLd3i(?C sDI~ 8uSu uRVumqrumvuI9)I90I9 I9  I9 %SM I9smI9`$(,;,F,F2D2T2T$($),0&*(,0404*/+0',#%*+Y`"' OT-0$7H2I4PeKf f# Y` ! Y` * Y`0Y`SMY`Y`"Y`%SMY`smY`! Y` (~Y` (~Y` $(!Y`"%)#Y`$!%%Y`&'Y`()Y`*#+Y`, -Y`|DH#',;,F,F2D2T2T$($),0&*)-0404*/+0',#%*);I001;E)->">?@;! ; * ; 0 ;SM;;";%SM;sm;!; (~; (~;$(; %)!;"!%#;$%;&';(#);, -;z tCy S %*7; d3i("'/tn  8)-)- /c2;;.//#s(O/ ~ / "# / sCx /j rDrIr()=rqrQrqr collectionsrcollections.abcrtypingrrrtokenizers.pre_tokenizerspre_tokenizersrw tokenizersrrxr rStokenizers.decodersr rwtokenizers.trainersr r r rrintegrations.ggmlrmodeling_gguf_pytorch_utilsrtokenization_utilsrtokenization_utils_baserrrrrrrrrrutilsrrr get_loggerrnloggerr4SPECIAL_TOKENS_MAP_FILETOKENIZER_CONFIG_FILETIKTOKEN_VOCAB_FILEr/r]rrr(rErIrGrs  #$''7/16^^:5=3   @?   H %"3/'( !! (6EXY,-H D5H D.H DrI