L i!9ddlZddlZddlZddlmZddlmZmZmZm Z m Z m Z m Z ddl mZmZej j#eZej j'edZee5Zej/ZdddGddZe eZe e eZGdd e ZGd d ZGd d ZGddZ efde e!de!fdZ"y#1swYWxYw)N)Template)AnyCallableDictList NamedTupleOptionalTuple)Encoding Tokenizerzvisualizer-styles.cssc@eZdZUeed<eed<eed<dededefdZy) Annotationstartendlabelc.||_||_||_yN)rrr)selfrrrs a/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/tokenizers/tools/visualizer.py__init__zAnnotation.__init__s  N)__name__ __module__ __qualname__int__annotations__strrrrrrs+ J H JcCrrc.eZdZUeeed<eeed<y) CharStateKeytoken_ixanno_ixN)rrrr rrrrrr r ssm c]rr cPeZdZUeeed<dZedZedZ de fdZ y) CharStatechar_ixc.||_d|_g|_yr)r%r"tokens)rr%s rrzCharState.__init__'s &* !# rcTt|jdkDr|jdSdS)Nrlenr'rs rr!zCharState.token_ix-s%!$T[[!1A!5t{{1~?4?rc2t|jdkDS)zJ BPE tokenizers can output more than one token for a char r)r+s r is_multitokenzCharState.is_multitoken1s 4;;!##rreturncDt|j|jS)N)r!r")r r!r"r+s r partition_keyzCharState.partition_key8s]]LL  rN) rrrr rrrpropertyr!r.r r1rrrr$r$$sG c]$ @@$$  | rr$c eZdZy)AlignedN)rrrrrrr4r4?srr4c ReZdZdZej dej Z ddede de e e ge ffdZgdfd ed ede e d e efd Zed ed eeeffd Zedeed edefdZed eded ed efdZed ed ed efdZed eded ed eefdZy)EncodingVisualizera Build an EncodingVisualizer Args: tokenizer (:class:`~tokenizers.Tokenizer`): A tokenizer instance default_to_notebook (:obj:`bool`): Whether to render html output in a notebook by default annotation_converter (:obj:`Callable`, `optional`): An optional (lambda) function that takes an annotation in any format and returns an Annotation object z(.{1})?(unk|oov)(.{1})?)flagsN tokenizerdefault_to_notebookannotation_converterct|r ddlm}m}||_||_||_y#t$r t dwxYw)NrHTMLdisplayzWe couldn't import IPython utils for html display. Are you running in a notebook? You can also pass `default_to_notebook=False` to get back raw HTML )IPython.core.displayr=r> ImportError Exceptionr8r9annotation_coverter)rr8r9r:r=r>s rrzEncodingVisualizer.__init__VsN  >##6 #7    s"7text annotationsr/cH|j}||}|r ddlm}m}|j tt|j |}|jj|}tj|||}|r|y|S#t$r t dwxYw)a Build a visualization of the given text Args: text (:obj:`str`): The text to tokenize annotations (:obj:`List[Annotation]`, `optional`): An optional list of annotations of the text. The can either be an annotation class or anything else if you instantiated the visualizer with a converter function default_to_notebook (:obj:`bool`, `optional`, defaults to `False`): If True, will render the html in a notebook. Otherwise returns an html string. Returns: The HTML string if default_to_notebook is False, otherwise (default) returns None and renders the HTML in the notebook Nrr<zeWe couldn't import IPython utils for html display. Are you running in a notebook?) r9r?r=r>r@rArBlistmapr8encoder6_EncodingVisualizer__make_html) rrCrDr9final_default_to_notebookr=r>encodinghtmls r__call__zEncodingVisualizer.__call__ls2%)$<$<!  *(; % $ >  # # /s4#;#;[IJK>>((.!--dHkJ $ DJ K 6 s B B!ct|dk(riSttd|}t|}td|z }|dkrd}d}d}d}i}t |D]}d|d |d |d ||<||z }|S) a Generates a color palette for all the labels in a given set of annotations Args: annotations (:obj:`Annotation`): A list of annotations Returns: :obj:`dict`: A dictionary mapping labels to colors in HSL format rc|jSr)r)xs rz;EncodingVisualizer.calculate_label_colors..s 177r @ zhsl(,z%,z%))r*setrGrsorted) rDlabels num_labelsh_stepslhcolorsrs rcalculate_label_colorsz)EncodingVisualizer.calculate_label_colorss { q IS*K89[ S:%& B;F   F^ E"1#QqcA3b1F5M KA  rconsecutive_chars_listrKc|d}|j|j|j}d|dS|d}|j}|jdz}|||}g} i} |j| jd|jr| jd|jdzr| jd n| jd t j j|j|j?| jd |j|j| d <n| jd ddj| d} d} | jD]\} }| d| d|dz } d| d| d|dS)a Converts a list of "consecutive chars" into a single HTML element. Chars are consecutive if they fall under the same word, token and annotation. The CharState class is a named tuple with a "partition_key" method that makes it easy to compare if two chars are consecutive. Args: consecutive_chars_list (:obj:`List[CharState]`): A list of CharStates that have been grouped together text (:obj:`str`): The original text being processed encoding (:class:`~tokenizers.Encoding`): The encoding returned from the tokenizer Returns: :obj:`str`: The HTML span for a set of consecutive chars rz(r-tokenz multi-tokenz odd-tokenz even-tokenz special-tokenstokz non-tokenzclass=" "z data-z="z) r%r'r!appendr.r6unk_token_regexsearchjoinitems)rbrCrKfirststokenlastrr span_text css_classes data_itemscssdatakeyvals rconsecutive_chars_to_htmlz,EncodingVisualizer.consecutive_chars_to_htmls2'q) == __U^^4F>fXXN N%b) llQsO   >> %   w '""""=1~~! "";/""<0!11889XYe""?3%-__U^^%D 6"   { +#((;/04"((* +HC fSEC5* *D +uAdV2i[88rcBtj|||}|dg}|dj}g}tj|}|dj}|.||} | j} || } |j d| d| d|ddD]} | j}||k7rm|j tj |||| g}||j d|.||} | j} || } |j d| d| d|}| j|djk(r|j | |j tj |||| g}|j tj |||t|} | S)Nrz&r-)rCrKrk) r6%_EncodingVisualizer__make_char_statesr"rarrlr{r1HTMLBody)rCrKrD char_statescurrent_consecutive_chars prev_anno_ixspanslabel_colors_dict cur_anno_ixannorcolorcsress r __make_htmlzEncodingVisualizer.__make_htmls(;;D(KX %0^$4!"1~-- .EEkR!!n,,  "{+DJJE%e,E LLA%W\V]]_` aab/& 1B**Kl* &@@1!!)A.0D)+LL+*&{3D JJE-e4ELL#I%P^_d^eeg!hi&L!%>q%A%O%O%QQ)004 &@@1!!)A.0D)M& 1R   8 8)! 9  uo rcdgt|z}t|D]/\}}t|j|jD]}|||< 1|S)a Args: text (:obj:`str`): The raw text we want to align to annotations (:obj:`AnnotationList`): A (possibly empty) list of annotations Returns: A list of length len(text) whose entry at index i is None if there is no annotation on character i or k, the index of the annotation that covers index i where k is with respect to the list of annotations N)r* enumeraterangerr)rCrDannotation_mapr"ais r__make_anno_mapz"EncodingVisualizer.__make_anno_map<s\#d)+#K0 ,JGQ177AEE* ,$+q! , ,rctj||}tt|Dcgc] }t |}}t |j D]M\}}|j|}||\} } t| | D] } || j j|"Ot |D]\}} | ||_ |Scc}w)a For each character in the original text, we emit a tuple representing it's "state": * which token_ix it corresponds to * which word_ix it corresponds to * which annotation_ix it corresponds to Args: text (:obj:`str`): The raw text we want to align to annotations (:obj:`List[Annotation]`): A (possibly empty) list of annotations encoding: (:class:`~tokenizers.Encoding`): The encoding returned from the tokenizer Returns: :obj:`List[CharState]`: A list of CharStates, indicating for each char in the text what it's state is ) r6"_EncodingVisualizer__make_anno_maprr*r$rr'token_to_charsrlr") rCrKrDrr%rr!reoffsetsrrrr"s r__make_char_statesz%EncodingVisualizer.__make_char_statesQs.,;;D+NJOPSTXPYJZ'[w '(:'[ '[(9 ;OHe--h7G"$ suc*;AN))00:;  ; !*. 9 3 GW+2K ( 3(\sC)TN)rrr__doc__recompile IGNORECASErmr boolr rrrrrAnnotationListrM staticmethodrrarr$r r{rIPartialIntListrr}rrrr6r6Cs !bjj!>bmmTO %)FJ   " 'xz0A'BC  2').2 ++$+&d^ + # +ZNtCH~8A9 $YA9A9A9A9F?#???SV??Bc>("""~"Z^_hZi""rr6childrenr/c6dj|}d|d|dS)a[ Generates the full html with css from a list of html spans Args: children (:obj:`List[str]`): A list of strings, assumed to be html elements css_styles (:obj:`str`, `optional`): Optional alternative implementation of the css Returns: :obj:`str`: An HTML string with style markup rjz?
z4
)ro)r css_styles children_texts rr~r~ws9GGH%M  O  r)# itertoolsosrstringrtypingrrrrrr r tokenizersr r pathdirname__file__ro css_filenameopenfreadrwrrrrr r$r4r6rr~rrrrs III* ''//( #ww||G%<=  ,1 &&(Cj!hsm$:   6  qqh .1tCySW s ,CC