L iddlZddlZddlmZmZmZmZmZddlm Z m Z m Z m Z m Z mZmZddlmZddlmZGddeZy) N)IteratorListOptionalUnionTuple) AddedTokenRegex Tokenizerdecoders normalizerspre_tokenizerstrainers)Unigram) BaseTokenizercDeZdZdZ ddeeeeefdede ffd Z dde eeefde de d eee ee fd eeed eef d Z dd e eeeeefde de d eee ee fd eeed eedee fdZedefdZxZS)SentencePieceUnigramTokenizerzzSentencePiece Unigram Tokenizer Represents the Unigram algorithm, with the pretokenization used by SentencePiece vocab replacementadd_prefix_spacec |tt|}ntt}tjtjtj tj tddg|_|rdnd}tj|||_ tj|||_ d||d}t|=||y)N {2,} alwaysneverrprepend_schemeSentencePieceUnigram)modelrr)r rr SequenceNmtNFKCReplacer normalizerr Metaspace pre_tokenizerr decodersuper__init__)selfrrr tokenizerr parameters __class__s v/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/implementations/sentencepiece_unigram.pyr)z&SentencePieceUnigramTokenizer.__init__s  !'%.1I!'),I*33 __  0 0 2K4G4GgX[4\ ] &67"0":":{cq"r $..;Wef ,& 0 J/files vocab_size show_progressspecial_tokensinitial_alphabet unk_tokenc|g}|g}tj|||||}t|tr|g}|jj ||y)a Train the model using the given files Args: files (:obj:`List[str]`): A list of path to the files that we should use for training vocab_size (:obj:`int`): The size of the final vocabulary, including all tokens and alphabet. show_progress (:obj:`bool`): Whether to show progress bars while training. special_tokens (:obj:`List[Union[str, AddedToken]]`, `optional`): A list of special tokens the model should know of. initial_alphabet (:obj:`List[str]`, `optional`): A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. unk_token (:obj:`str`, `optional`): The unknown token to be used by the model. Nr1r3r2r4r5)trainer)rUnigramTrainer isinstancestr _tokenizertrain)r*r0r1r2r3r4r5r8s r.r=z#SentencePieceUnigramTokenizer.train,sd<  !N  #! ))!)'-   eS !GE eW5r/iteratorlengthc|g}|g}tj|||||}|jj|||y)a Train the model using the given iterator Args: iterator (:obj:`Union[Iterator[str], Iterator[Iterator[str]]]`): Any iterator over strings or list of strings vocab_size (:obj:`int`): The size of the final vocabulary, including all tokens and alphabet. show_progress (:obj:`bool`): Whether to show progress bars while training. special_tokens (:obj:`List[Union[str, AddedToken]]`, `optional`): A list of special tokens the model should know of. initial_alphabet (:obj:`List[str]`, `optional`): A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. unk_token (:obj:`str`, `optional`): The unknown token to be used by the model. length (:obj:`int`, `optional`): The total number of sequences in the iterator. This is used to provide meaningful progress tracking Nr7)r8r?)rr9r<train_from_iterator) r*r>r1r2r3r4r5r?r8s r.rAz1SentencePieceUnigramTokenizer.train_from_iterator\s]D  !N  #! ))!)'-   ++  , r/filenamecR ddl}|jjdddl}|j }|j t|dj|jj}|jDcgc]}|j|jf}}|jj}|jj }|jj"} |dk7r t dd} d} t%t'||| } |rMt)j*t)j,|t)j.t1d d g| _n8t)j*t)j.t1d d g| _| rd nd } t5j6| | | _t;j6| | | _ddi}t?j@tB| |}t?jD|| ||S#t$r t dwxYwcc}w)Nr.a\You don't seem to have the required protobuf file, in order to use this function you need to run `pip install protobuf` and `wget https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py` for us to be able to read the intrinsics of your spm_file. `pip install sentencepiece` is not required.rbrz]You're trying to run a `Unigram` model but you're file was trained with a different algorithm▁Trrrrrrr)#syspathappendsentencepiece_model_pb2 Exception ModelProtoParseFromStringopenreadnormalizer_specprecompiled_charsmappiecespiecescore trainer_specunk_id model_type byte_fallbackr rr r Precompiledr#r r$r r%r&r r'r__new__rr))rBrGrmrQrSrrVrWrXrrr+rr,objs r.from_spmz&SentencePieceUnigramTokenizer.from_spms   HHOOC 3     $x.3356 00EE9:B%++u{{+BB&&^^.. 44 ?o  geV]CD #.#7#7++,@A''g<$I $/#7#79L9LUSZ^]`9a8b#cI %57"0":":{cq"r $..;Wef  + ##$A9jYsIz: U o  Cs#H H$ H!)NrFT)@TNNN)r^TNNNN)__name__ __module__ __qualname____doc__rrrr;floatboolr)rintrr=rrA staticmethodr] __classcell__)r-s@r.rr sy48 !% 0U3:./000 0<"AE04#'.6S$s)^$.6.6 .6 !eCO& .6 #49- .6C=.6f"AE04#' $4  x '>>?4 4  4 !eCO& 4 #49- 4 C=4  4 l1311r/r)jsonostypingrrrrr tokenizersrr r r r r rtokenizers.modelsrbase_tokenizerrrr/r.ros. 99ddd%)yMyr/