L i=ddlmZmZmZmZmZddlmZmZm Z m Z m Z ddl m Z ddlmZddlmZddlmZddlmZeeefZGdd Zy ) )DictListOptionalTupleUnion) AddedToken EncodeInputEncoding InputSequence Tokenizer)Decoder)Model) Normalizer) PreTokenizer) PostProcessorc4eZdZdJdefdZdZdedefdZdKdede e effd Z de ee ffd Z dKdedefd Z dLd ee d eedeedeedee deef dZdZedeefdZdMdedeedee fdZdZedeefdZdeee e fdefdZdeee e fdefdZde de fd Z dNded!eed"ed#edef d$Z dOd%ee d"ed#edeefd&Z! dOd%ee d"ed#edeefd'Z" dOd%ee d"ed#edeefd(Z#dKd)eed*eede fd+Z$dKd,eeed*eede fd-Z%d.e deefd/Z&d0edee fd1Z'dJd2e d3ee fd4Z(dKd5e d6efd7Z)dPd6efd8Z* dQd9ed!eed#edefd:Z+ede,fd;Z-e-j\dZ0e0j\d?e/fd@Z0ede1fdAZ2e2j\dBe1fdCZ2ede3fdDZ4e4j\dEe3fdFZ4ede5fdGZ6e6j\dHe5fdIZ6y)R BaseTokenizerN tokenizerc4||_|||_yi|_yN) _tokenizer _parameters)selfr parameterss o/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/implementations/base_tokenizer.py__init__zBaseTokenizer.__init__s#)3)?:Rcdj|jjdjd|jj DS)Nz!Tokenizer(vocabulary_size={}, {})z, c3DK|]\}}|dzt|zyw)=N)str).0kvs r z)BaseTokenizer.__repr__..s!L41aa#gA&Ls )formatrget_vocab_sizejoinritemsrs r__repr__zBaseTokenizer.__repr__sF299 OO * * , IIL43C3C3I3I3KL L  ris_pairreturnc8|jj|S)z Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return: )rnum_special_tokens_to_add)rr,s rr/z'BaseTokenizer.num_special_tokens_to_adds 88AArwith_added_tokensc:|jj|S)zReturns the vocabulary Args: with_added_tokens: boolean: Whether to include the added tokens in the vocabulary Returns: The vocabulary r0)r get_vocabrr0s rr3zBaseTokenizer.get_vocab!s((;L(MMrc6|jjS)z|Returns the added reverse vocabulary Returns: The added vocabulary mapping ints to AddedTokens )rget_added_tokens_decoderr*s rr6z&BaseTokenizer.get_added_tokens_decoder-s 7799rc:|jj|S)zReturn the size of vocabulary, with or without added tokens. Args: with_added_tokens: (`optional`) bool: Whether to count in added special tokens or not Returns: Size of vocabulary r2)rr'r4s rr'zBaseTokenizer.get_vocab_size5s--@Q-RRr directionpad_to_multiple_ofpad_id pad_type_id pad_tokenlengthcD|jj||||||S)aChange the padding strategy Args: direction: (`optional`) str: Can be one of: `right` or `left` pad_to_multiple_of: (`optional`) unsigned int: If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad with a length of 250 but `pad_to_multiple_of=8` then we will pad to 256. pad_id: (`optional`) unsigned int: The indice to be used when padding pad_type_id: (`optional`) unsigned int: The type indice to be used when padding pad_token: (`optional`) str: The pad token to be used when padding length: (`optional`) unsigned int: If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch )r8r9r:r;r<r=)renable_padding)rr8r9r:r;r<r=s rr?zBaseTokenizer.enable_paddingAs3B--1# .  rc6|jjS)zDisable padding)r no_paddingr*s rrAzBaseTokenizer.no_paddingks))++rc.|jjS)zGet the current padding parameters Returns: None if padding is disabled, a dict with the currently set parameters if the padding is enabled. )rpaddingr*s rrCzBaseTokenizer.paddingos&&&r max_lengthstridestrategyc>|jj|||S)aChange the truncation options Args: max_length: unsigned int: The maximum length at which to truncate stride: (`optional`) unsigned int: The length of the previous first sequence to be included in the overflowing sequence strategy: (`optional`) str: Can be one of `longest_first`, `only_first` or `only_second` )rErF)renable_truncation)rrDrErFs rrHzBaseTokenizer.enable_truncationys!00FU]0^^rc6|jjS)zDisable truncation)r no_truncationr*s rrJzBaseTokenizer.no_truncations,,..rc.|jjS)zGet the current truncation parameters Returns: None if truncation is disabled, a dict with the current truncation parameters if truncation is enabled )r truncationr*s rrLzBaseTokenizer.truncations)))rtokensc8|jj|S)aPAdd the given tokens to the vocabulary Args: tokens: List[Union[str, AddedToken]]: A list of tokens to add to the vocabulary. Each token can either be a string, or an instance of AddedToken Returns: The number of tokens that were added to the vocabulary )r add_tokens)rrMs rrOzBaseTokenizer.add_tokenss))&11rspecial_tokensc8|jj|S)aAdd the given special tokens to the vocabulary, and treat them as special tokens. The special tokens will never be processed by the model, and will be removed while decoding. Args: tokens: List[Union[str, AddedToken]]: A list of special tokens to add to the vocabulary. Each token can either be a string, or an instance of AddedToken Returns: The number of tokens that were added to the vocabulary )radd_special_tokens)rrPs rrRz BaseTokenizer.add_special_tokenss11.AArsequencec8|jj|S)zNormalize the given sequence Args: sequence: str: The sequence to normalize Returns: The normalized string )r normalize)rrSs rrUzBaseTokenizer.normalizes((22rpairis_pretokenizedrRcX| td|jj||||S)a/Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences. Args: sequence: InputSequence: The sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the `is_pretokenized` argument: - If `is_pretokenized=False`: `InputSequence` is expected to be `str` - If `is_pretokenized=True`: `InputSequence` is expected to be `Union[List[str], Tuple[str]]` is_pretokenized: bool: Whether the input is already pre-tokenized. add_special_tokens: bool: Whether to add the special tokens while encoding. Returns: An Encoding z"encode: `sequence` can't be `None`) ValueErrorrencode)rrSrVrWrRs rrZzBaseTokenizer.encodes28  AB B%%hoGYZZrinputscV| td|jj|||S)aEncode the given inputs. This method accept both raw text sequences as well as already pre-tokenized sequences. Args: inputs: List[EncodeInput]: A list of single sequences or pair sequences to encode. Each `EncodeInput` is expected to be of the following form: `Union[InputSequence, Tuple[InputSequence, InputSequence]]` Each `InputSequence` can either be raw text or pre-tokenized, according to the `is_pretokenized` argument: - If `is_pretokenized=False`: `InputSequence` is expected to be `str` - If `is_pretokenized=True`: `InputSequence` is expected to be `Union[List[str], Tuple[str]]` is_pretokenized: bool: Whether the input is already pre-tokenized. add_special_tokens: bool: Whether to add the special tokens while encoding. Returns: A list of Encoding z&encode_batch: `inputs` can't be `None`)rYr encode_batchrr[rWrRs rr]zBaseTokenizer.encode_batchs0@ >EF F++FOEWXXrcrK| td|jj|||d{S7w)aKAsynchronously encode a batch (tracks character offsets). Args: inputs: A list of single or pair sequences to encode. is_pretokenized: Whether inputs are already pre-tokenized. add_special_tokens: Whether to add special tokens. Returns: A list of Encoding. Nz,async_encode_batch: `inputs` can't be `None`)rYrasync_encode_batchr^s rr`z BaseTokenizer.async_encode_batchs9 >KL L__77Qcdddd .757crK| td|jj|||d{S7w)aOAsynchronously encode a batch (no character offsets, faster). Args: inputs: A list of single or pair sequences to encode. is_pretokenized: Whether inputs are already pre-tokenized. add_special_tokens: Whether to add special tokens. Returns: A list of Encoding. Nz1async_encode_batch_fast: `inputs` can't be `None`)rYrasync_encode_batch_fastr^s rrcz%BaseTokenizer.async_encode_batch_fasts9 >PQ Q__< N:$sJ*?: S S S$+,0 !%&#* $( C=( %SM(  ( c] ( C= (  ( T,'$''_C_#_V^_bVc_ /*HTN** 2eCO&rs;55RR'#-2/ S/}*}*r