L imx ddlZddlZddlmZmZmZmZddlZddl m Z ddl m Z m Z mZmZddlmZmZmZmZer ddlZddlmZer ddlZdd lmZGd d eZGd d e Ze eddGddeZeZy)N)AnyOptionalUnionoverload)BasicTokenizer) ExplicitEnumadd_end_docstringsis_tf_availableis_torch_available)ArgumentHandler ChunkPipelineDatasetbuild_pipeline_init_args)/TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES),MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMESc,eZdZdZdeeeeffdZy)"TokenClassificationArgumentHandlerz5 Handles arguments for token classification. inputsc |jdd}|jd}|;t|ttfr%t |dkDrt|}t |}nWt|t r|g}d}nAt t|t st|tjr||d|fStd|jd}|r?t|trt|dtr|g}t ||k7r td||||fS) Nis_split_into_wordsF delimiterrr zAt least one input is required.offset_mappingz;offset_mapping should have the same batch size as the input) get isinstancelisttuplelenstrrtypes GeneratorType ValueError)selfrkwargsrr batch_sizers q/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/transformers/pipelines/token_classification.py__call__z+TokenClassificationArgumentHandler.__call__ s$jj)>FJJ{+  *VdE]"CF VW&\FVJ  $XFJ  Z%@JvW\WjWjDk.i? ?>? ?$45 .$/J~a?PRW4X"0!1>"j0 !^__*NIEEN)__name__ __module__ __qualname____doc__rr rr(r)r'rrs"FuS$s)^4Fr)rc$eZdZdZdZdZdZdZdZy)AggregationStrategyzDAll the valid aggregation strategies for TokenClassificationPipelinenonesimplefirstaveragemaxN) r*r+r,r-NONESIMPLEFIRSTAVERAGEMAXr.r)r'r0r08sN D F EG Cr)r0T) has_tokenizera ignore_labels (`list[str]`, defaults to `["O"]`): A list of labels to ignore. grouped_entities (`bool`, *optional*, defaults to `False`): DEPRECATED, use `aggregation_strategy` instead. Whether or not to group the tokens corresponding to the same entity together in the predictions or not. stride (`int`, *optional*): If stride is provided, the pipeline is applied on all the text. The text is split into chunks of size model_max_length. Works only with fast tokenizers and `aggregation_strategy` different from `NONE`. The value of this argument defines the number of overlapping tokens between chunks. In other words, the model will shift forward by `tokenizer.model_max_length - stride` tokens each step. aggregation_strategy (`str`, *optional*, defaults to `"none"`): The strategy to fuse (or not) tokens based on the model prediction. - "none" : Will simply not do any aggregation and simply return raw results from the model - "simple" : Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as different entities. On word based languages, we might end up splitting words undesirably : Imagine Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Look for FIRST, MAX, AVERAGE for ways to mitigate that and disambiguate words (on languages that support that meaning, which is basically tokens separated by a space). These mitigations will only work on real words, "New york" might still be tagged with two different entities. - "first" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot end up with different tags. Words will simply use the tag of the first token of the word when there is ambiguity. - "average" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot end up with different tags. scores will be averaged first across tokens, and then the maximum label is applied. - "max" : (works only on word based models) Will use the `SIMPLE` strategy except that words, cannot end up with different tags. Word entity will simply be the token with the maximum score.ceZdZdZdZdZdZdZdZe ffd Z d)de e de e d e e d e eeeefd e d e ed e efdZedededeeeeffdZedeededeeeeeffdZdeeeefdedeeeeefeeeeeffffd Zd*dZdZe j4dfdZdZ d+dedej<dej<d e eeeefdej<d e de ee ede eeeefdeefdZd eed e deefd!Z d"eed e defd#Z!d"eed e deefd$Z"d"eedefd%Z#d&edeeeffd'Z$d"eedeefd(Z%xZ&S),TokenClassificationPipelineuv Named Entity Recognition pipeline using any `ModelForTokenClassification`. See the [named entity recognition examples](../task_summary#named-entity-recognition) for more information. Example: ```python >>> from transformers import pipeline >>> token_classifier = pipeline(model="Jean-Baptiste/camembert-ner", aggregation_strategy="simple") >>> sentence = "Je m'appelle jean-baptiste et je vis à montréal" >>> tokens = token_classifier(sentence) >>> tokens [{'entity_group': 'PER', 'score': 0.9931, 'word': 'jean-baptiste', 'start': 12, 'end': 26}, {'entity_group': 'LOC', 'score': 0.998, 'word': 'montréal', 'start': 38, 'end': 47}] >>> token = tokens[0] >>> # Start and end provide an easy way to highlight words in the original text. >>> sentence[token["start"] : token["end"]] ' jean-baptiste' >>> # Some models use the same idea to do part of speech. >>> syntaxer = pipeline(model="vblagoje/bert-english-uncased-finetuned-pos", aggregation_strategy="simple") >>> syntaxer("My name is Sarah and I live in London") [{'entity_group': 'PRON', 'score': 0.999, 'word': 'my', 'start': 0, 'end': 2}, {'entity_group': 'NOUN', 'score': 0.997, 'word': 'name', 'start': 3, 'end': 7}, {'entity_group': 'AUX', 'score': 0.994, 'word': 'is', 'start': 8, 'end': 10}, {'entity_group': 'PROPN', 'score': 0.999, 'word': 'sarah', 'start': 11, 'end': 16}, {'entity_group': 'CCONJ', 'score': 0.999, 'word': 'and', 'start': 17, 'end': 20}, {'entity_group': 'PRON', 'score': 0.999, 'word': 'i', 'start': 21, 'end': 22}, {'entity_group': 'VERB', 'score': 0.998, 'word': 'live', 'start': 23, 'end': 27}, {'entity_group': 'ADP', 'score': 0.999, 'word': 'in', 'start': 28, 'end': 30}, {'entity_group': 'PROPN', 'score': 0.999, 'word': 'london', 'start': 31, 'end': 37}] ``` Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial) This token recognition pipeline can currently be loaded from [`pipeline`] using the following task identifier: `"ner"` (for predicting the classes of tokens in a sequence: person, organisation, location or miscellaneous). The models that this pipeline can use are models that have been fine-tuned on a token classification task. See the up-to-date list of available models on [huggingface.co/models](https://huggingface.co/models?filter=token-classification). sequencesFTc t|di||j|jdk(rtnt t d|_||_y)NtfF) do_lower_caser.) super__init__check_model_type frameworkrrr_basic_tokenizer _args_parser)r$ args_parserr% __class__s r'rCz$TokenClassificationPipeline.__init__sL "6" ~~% <= !/U C'r)Ngrouped_entitiesignore_subwordsaggregation_strategyrrstriderc "i} || d<|r |dn|| d<||| d<i} ||p|r|rtj}n%|r|stj}ntj}|t j d|d|t j d|d|~t |trt|j}|tjtjtjhvr!|jjs td|| d <||| d <|s||jjk\r td |tjk(rtd |d |jjr dd|d} | | d<n td| i| fS)Nr rrzl`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="z "` instead.zk`ignore_subwords` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="z{Slow tokenizers cannot handle subwords. Please set the `aggregation_strategy` option to `"simple"` or use a fast tokenizer.rL ignore_labelszl`stride` must be less than `tokenizer.model_max_length` (or even lower if the tokenizer adds special tokens)zI`stride` was provided to process all the text but `aggregation_strategy="z&"`, please select another one instead.T)return_overflowing_tokenspaddingrMtokenizer_paramszm`stride` was provided to process all the text but you're using a slow tokenizer. Please use a fast tokenizer.)r0r8r7r6warningswarnrr upperr:r9 tokenizeris_fastr#model_max_length) r$rPrJrKrLrrrMrpreprocess_paramspostprocess_paramsrSs r'_sanitize_parametersz0TokenClassificationPipeline._sanitize_parameterss3F/0 4=4ES9 k *  %2@ . /  '?+FO':'@'@$!/':'A'A$':'?'?$+ //C.DKQ* //C.DKQ +.4':;O;U;U;W'X$$'--/B/F/FH[HcHcde.. >:N 5 6  $2?  /  888 C$':'?'?? ,--SU >>))59#'"(($ =M%&89$8!"&888r)rr%returnc yNr.r$rr%s r'r(z$TokenClassificationPipeline.__call__sLOr)c yr_r.r`s r'r(z$TokenClassificationPipeline.__call__sX[r)c |j|fi|\}}}}||d<||d<|r#td|Dst| |gfi|S|r||d<t| |fi|S)a Classify each token of the text(s) given as inputs. Args: inputs (`str` or `List[str]`): One or several texts (or one list of texts) for token classification. Can be pre-tokenized when `is_split_into_words=True`. Return: A list or a list of list of `dict`: Each result comes as a list of dictionaries (one for each token in the corresponding input, or each entity if this pipeline was instantiated with an aggregation_strategy) with the following keys: - **word** (`str`) -- The token/word classified. This is obtained by decoding the selected tokens. If you want to have the exact string in the original sentence, use `start` and `end`. - **score** (`float`) -- The corresponding probability for `entity`. - **entity** (`str`) -- The entity predicted for that token/word (it is named *entity_group* when *aggregation_strategy* is not `"none"`. - **index** (`int`, only present when `aggregation_strategy="none"`) -- The index of the corresponding token in the sentence. - **start** (`int`, *optional*) -- The index of the start of the corresponding entity in the sentence. Only exists if the offsets are available within the tokenizer - **end** (`int`, *optional*) -- The index of the end of the corresponding entity in the sentence. Only exists if the offsets are available within the tokenizer rrc3<K|]}t|tywr_)rr).0inputs r' z7TokenClassificationPipeline.__call__..s*Wu:eT+B*Wsr)rGallrBr()r$rr%_inputsrrrrIs r'r(z$TokenClassificationPipeline.__call__s:CT$BSBSTZBe^dBe?$ni(;$%'{ s*WPV*W'W7#VH77 7 '5F# $w1&11r)c +K|jdi}|jjxr|jjdkD}d}|d}|r|d}t|ts t d|} |j | }g}t|} d} | D]2} |j| | t| zf| t| | zz } 4| } d|d<nt|ts t d|} |j| f|j|d|jjd|}|r!|jjs t d |jd dt|d }t|D]}|jd k(r;|jDcic]\}}|tj||d!}}}n5|jDcic]\}}|||j!d}}}|||d <|dk(r|nd|d<||dz k(|d<||j#||d<||d<|ycc}}wcc}}ww)NrSrrrzEWhen `is_split_into_words=True`, `sentence` must be a list of tokens.TzKWhen `is_split_into_words=False`, `sentence` must be an untokenized string.)return_tensors truncationreturn_special_tokens_maskreturn_offsets_mappingz@is_split_into_words=True is only supported with fast tokenizers.overflow_to_sample_mapping input_idsr@rsentencer is_lastword_idsword_to_chars_map)poprWrYrrr#joinrappendr rErXrangeitemsr@ expand_dims unsqueezerr)r$rprrZrSrkrsrrwords delimiter_len char_offsetwordtext_to_tokenizer num_chunksikv model_inputss r' preprocessz&TokenClassificationPipeline.preprocessse,001CRH^^44\9X9X[\9\  /0EF )+6Ih- !hiiE ~~e,H "  NMK 9!((+{SY7N)OPs4y=88  9 % 6: 2 3h, !noo'   >>!'+#'>>#9#9     t~~'='=_` ` /6 ,- z" A~~%GM||~Vtq!2>>!A$#: :V VAGPA1Q4>>!#4 4P P)1? -.346xtL $&':>&9L # ,+1??1+= Z(4E 01  VPsFH?$H3:H?H90AH?c|jd}|jdd}|jd}|jd}|jdd}|jdd}|jdk(r|jd i|d}n,|jd i|} t| tr| d n| d}|||||||d |S) Nspecial_tokens_maskrrprqrrrsr@rlogits)rrrrprqrrrsr.)rtrEmodelrdict) r$rrrrprqrrrsroutputs r'_forwardz$TokenClassificationPipeline._forwardTs*../DE%))*:DA##J/""9-##J5(,,-@$G >>T !TZZ/,/2FTZZ/,/F)3FD)AVH%vayF#6,  !2    r)c |dg}g}|djd}|D]}|jdk(rf|ddjtjtj fvr4|ddj tjj}n|ddj}|dd}|dd} |d|ddnd} |d dj} |jd } tj|d d } tj|| z }||jd d z }|jdk(r$| j} | | jnd} |j|| || | || |}|j||}|Dcgc],}|jdd|vr|jdd|vr|.}}|j|t!|}|dkDr|j#|}|Scc}w)NOrrsptrrprorrrrT)axiskeepdimsr@)rrrsentity entity_groupr )rrEdtypetorchbfloat16float16tofloat32numpynpr5expsumgather_pre_entities aggregateextendraggregate_overlapping_entities)r$ all_outputsrLrP all_entitiesrs model_outputsrrprorrrrmaxes shifted_expscores pre_entitiesrJrentitiesrs r' postprocessz'TokenClassificationPipeline.postprocessnsF   EM (N../BC(( *M~~%-*A!*D*J*Ju~~_d_l_lNm*m&x0366u}}EKKM&x0399;"1~j1H%k215I6CDT6U6a ./2gk #00E"Fq"I"O"O"Q $((4HFF6T:E&&%0K ;??T?#JJF~~%%OO- ;I;U!5!5!7[_33#$!"34 L $~~l>>|LLs1Hc4t|dk(r|St|d}g}|d}|D]\}|d|dcxkr|dkr3nn0|d|dz }|d|dz }||kDs||k(s;|d|dkDsG|}J|j||}^|j||S)Nrc |dS)Nstartr.)xs r'zLTokenClassificationPipeline.aggregate_overlapping_entities..s !G*r)keyrendscore)rsortedrv)r$raggregated_entitiesprevious_entityrcurrent_lengthprevious_lengths r'rz:TokenClassificationPipeline.aggregate_overlapping_entitiess x=A O((<= "1+ )Fw'6'?S_U=SS!'!@"1%"8?7;S"S"_4%8w/'*BB&,O#**?;"( ) ""?3""r)rprorrrrrsc lg} t|D]\} } || r |jjt|| } |L|| \} }|||| }|||\}}| |z } ||z }t | ts/|j dk(r | j } |j }|| |}t|jddrCt|jjjddrt| t|k7}n_|tjtjtjhvrtj dt"| dkDxr d|| dz | dzv}t|| |jj$k(r |} d }nd} d}d }| | | || |d }| j'|| S) zTFuse various numpy arrays into dicts with all the information needed for aggregationNr _tokenizercontinuing_subword_prefixz?Tokenizer does not support real words, using fallback heuristicrrOr F)r~rrrindex is_subword) enumeraterWconvert_ids_to_tokensintrrEitemgetattrrrrr0r8r9r:rTrU UserWarning unk_token_idrv)r$rprorrrrLrrrsridx token_scoresr~ start_indend_ind word_index start_char_word_refr pre_entitys r'rz/TokenClassificationPipeline.gather_pre_entitiess !*6!29 , C"3'>>77IcN8KLD)%3C%8" 7',=,I!)#J!-(9*(E A!Z/ :-!)S1~~-$-NN$4 "),,.#Ig64>><>7NN--335PRVD "%Tc(m!;J,+11+33+//0 ! ]'"+Q!e3hyST}W`cdWd>e3eJy~&$..*E*EE#D!&J " &"( J    +s9 ,tr)rc|tjtjhvrlg}|D]d}|dj}|d|}|jj j |||d|d|d|dd}|j|fn|j||}|tjk(r|S|j|S)Nrrr~rr)rrrr~rr) r0r6r7argmaxrconfigid2labelrvaggregate_wordsgroup_entities)r$rrLrr entity_idxrrs r'rz%TokenClassificationPipeline.aggregates $7$<$<>Q>X>X#Y YH* ( '188: "8,Z8"jj//88D"'0&v.'0%e, ' (++L:NOH #6#;#; ;O""8,,r)rc(|jj|Dcgc]}|d c}}|tjk(rA|dd}|j }||}|j j j|}n|tjk(rLt|d}|d}|j }||}|j j j|}n|tjk(rvtj|Dcgc]}|d c}}tj|d} | j } |j j j| }| | }n td||||dd|d d d } | Scc}wcc}w) Nr~rrc(|djS)Nr)r5)rs r'rz.!s&:J:N:N:Pr)r)rzInvalid aggregation_strategyrrr)rrr~rr)rWconvert_tokens_to_stringr0r8rrrrr:r5r9rstacknanmeanr#) r$rrLrr~rrr max_entityaverage_scoresr new_entitys r'aggregate_wordz*TokenClassificationPipeline.aggregate_words|~~66U]7^6v7^_ #6#<#< <a[*F--/C3KEZZ&&//4F !%8%<%< <X+PQJ)F--/C3KEZZ&&//4F !%8%@%@ @XXhGFvh/GHFZZQ7N'..0JZZ&&// ;F":.E;< <a[)B<&  78_Hs F  Fc>|tjtjhvr tdg}d}|D]C}||g} |dr|j | |j |j |||g}E|!|j |j |||S)z Override tokens from a given word that disagree to force agreement on word boundaries. Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT z;NONE and SIMPLE strategies are invalid for word aggregationNr)r0r6r7r#rvr)r$rrL word_entities word_grouprs r'rz+TokenClassificationPipeline.aggregate_words7s  $ $  & &$  Z[ [   &F!$X  %!!&)$$T%8%8EY%Z[$X  &  !  !4!4ZAU!V Wr)c@|ddjddd}tj|Dcgc]}|d c}}|Dcgc]}|d }}tj||jj ||dd|dd d }|Scc}wcc}w) z Group together the adjacent tokens with the same entity predicted. Args: entities (`dict`): The entities predicted by the pipeline. rr-r rrr~rr)rrr~rr)splitrrmeanrWr)r$rrrtokensrs r'group_sub_entitiesz.TokenClassificationPipeline.group_sub_entitiesSs!X&,,S!4R88DVG_DE/78V&.88#WWV_NN;;FCa[)B<&  E8s B B entity_namec|jdr d}|dd}||fS|jdr d}|dd}||fSd}|}||fS)NzB-BrzI-I) startswith)r$rbitags r'get_tagz#TokenClassificationPipeline.get_taghsk  ! !$ 'Bab/C3w # #D )Bab/C 3wBC3wr)chg}g}|D]}|s|j||j|d\}}|j|dd\}}||k(r|dk7r|j|d|j|j||g}|r |j|j||S)z Find and group together the adjacent tokens with the same entity predicted. Args: entities (`dict`): The entities predicted by the pipeline. rrr)rvrr) r$r entity_groupsentity_group_disaggrrrlast_bilast_tags r'rz*TokenClassificationPipeline.group_entitiesvs   /F&#**62 ll6(#34GB $ -@-DX-N O GXh29#**62$$T%<%<=P%QR'-h#' /(   !8!89L!M Nr))NNNNNFNNr_)NN)'r*r+r,r-default_input_names_load_processor_load_image_processor_load_feature_extractor_load_tokenizerrrCrboolr0rrrr r\rrrr(rrrr6rrrndarrayrrrrrrr __classcell__)rIs@r'r=r=Bs F"H&O!#O#E#G (+/*.>B:>$) $#'N9#4.N9"$ N9 '':; N9 !eCHo!67 N9"N9 N9C=N9`OsOcOd4S>6JOO [tCy[C[Dd3PS8nAU%?? @%2N9v 4=Prs 11: TS^XFF:,40n!Du-uE!Dup* r)