L i0tddlmZddlZddlmZddlmZddlmZddl mZ ddl m Z m Z mZmZmZddlmZdd lmZdd lmZmZmZmZmZdd Zdd Zedd ZeddZeeddZ d ddZ ddZ!ddZ"ddZ#d dZ$ed d! d"dZ%y)#) annotationsN)IncrementalDecoder)Counter) lru_cache) FREQUENCIESKO_NAMESLANGUAGE_SUPPORTED_COUNTTOO_SMALL_SEQUENCEZH_NAMES) is_suspiciously_successive_range)CoherenceMatches)is_accentuatedis_latinis_multi_byte_encodingis_unicode_range_secondary unicode_rangect|r tdtjd|j}|d}i}d}t ddD]W}|j t|g}|s!t|}|/t|dus=||vrd||<||xxd z cc<|d z }Yt|Dcgc]}|||z d k\r|c}Scc}w) zF Return associated unicode ranges in a single byte code page. z.Function not supported on multi-byte code pagez encodings.ignore)errorsr@Frg333333?) rOSError importlib import_modulerrangedecodebytesrrsorted) iana_namedecoderp seen_rangescharacter_countichunkcharacter_ranges [/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/charset_normalizer/cd.pyencoding_unicode_ranger)si(FGG%% 9+&>?RRG#84A"$KO 4  %XXeQCj) *7*>O&)/:eC"+534K0O,1,1$ % $/ ?+o=E    s5Ccg}tjD]-\}}|D]#}t||k(s|j|-/|S)z> Return inferred languages used with a unicode range. )ritemsrappend) primary_range languageslanguage characters characters r(unicode_range_languagesr2@s[I + 1 1 3*# IY'=8  *  cZt|}d}|D] }d|vs|}n|dgSt|S)z Single-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. NLatin Latin Based)r)r2)r unicode_rangesr-specified_ranges r(encoding_languagesr9OsM !7y AN $M) / )+M   "= 11r3c|jds'|jds|jds|dk(rdgS|jds|tvrdgS|jds|tvrd gSgS) z Multi-byte encoding language association. Some code page are heavily linked to particular language(s). This function does the correspondence. shift_ iso2022_jpeuc_jcp932JapanesegbChinese iso2022_krKorean) startswithr r )r s r(mb_encoding_languagesrEcs} X&    -    (  |D!Y(%:{L)Y(-Bz Ir3)maxsizecrd}d}t|D]$}|s t|rd}|st|dus#d}&||fS)zg Determine main aspects from a supported language if it contains accents and if is pure Latin. FT)rrr)r/target_have_accentstarget_pure_latinr1s r(get_target_featuresrJxsW !&" *& "~i'@"&  )!4!= %  &  1 11r3cg}td|D}tjD]h\}}t|\}}|r|dur|dur|r"t |}t |D cgc] } | |vs|  c} } | |z } | dk\sV|j || fjt |dd}|D cgc]} | d c} Scc} wcc} w)zE Return associated languages associated to given characters. c32K|]}t|ywN)r).0r1s r( z%alphabet_languages..sTInY7TsFg?c |dSNrxs r(z$alphabet_languages..s !r3Tkeyreverser)anyrr+rJlenr,r) r0ignore_non_latinr.source_have_accentsr/language_charactersrHrIr$ccharacter_match_countratiocompatible_languages r(alphabet_languagesrbs *,ITTT)4):):)<0%%1DX1N..  1U :  % ',? "#67%(+ ?1qJQ ?& -> C<   h. /%0(yndCI>G H':  " HH @ Is B6 )B6 ' B;c8|tvrt|dd}tt|}t|}tt|}|dkD}t |t d|D]'\}}||vr t|j |} ||z } t|| z} |durt| | z dkDrM|durt| | z |dz kr|dz }kt|d| } t|| d } |d|}||d }tt|t| z}tt|t| z}t| dk(r |dkr|dz }t| dk(r |dkr|dz }|t| z d k\s|t| z d k\s#|dz }*|t|z S) aN Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language. The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit). Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.) z not availablerFTrNg?) r ValueErrorsetrZziprindexintabs)r/ordered_characterscharacter_approved_countFREQUENCIES_language_setordered_characters_count target_language_characters_countlarge_alphabetr1character_rankcharacter_rank_in_languageexpected_projection_ratiocharacter_rank_projectioncharacters_before_sourcecharacters_after_sourcecharacters_beforecharacters_afterbefore_match_countafter_match_counts r(characters_popularity_comparer}s;{"H:^455$%";x#89$'(:$;,/ H0E,F$;b@N%(E!%=>&8! > 4 4 *5h*?*E*Ei*P" ,/G G "*-^>W-W)X! e #-0JJKaO  d "-0JJK.23 % ) $ .9(.C (/  .9-B & '. (:!N'K&8&I"% ! "S)A%B B# "%  !C(?$@ @"  ' (A -2D2I $ ) $  & '1 ,1Ba1G $ ) $  %=!> ># E 3'>#??3F $ ) $ q8t $c*<&= ==r3c,i}|D]u}|jdurt|}|$d}|D]}t||dus|}n||}||vr|j||<[||xx|jz cc<wt |j S)a Given a decoded text sequence, return a list of str. Unicode range / alphabet separation. Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list; One containing the latin letters and the other hebrew. FN)isalpharr lowerlistvalues)decoded_sequencelayersr1r'layer_target_rangediscovered_ranges r(alpha_unicode_splitrs  F%8    % ' &3I&>  " )- &  01A?S&6"    %!0  V +)2):F% & !"ioo&77"588    r3c i}|D]-}|D]&}|\}}||vr|g||<||j|(/|Dcgc]+}|tt||t||z df-}}t |ddScc}w)z This function merge results previously given by the function coherence_ratio. The return type is the same as coherence_ratio. rec |dSrQrRrSs r(rUz(merge_coherence_ratios..<s qtr3TrV)r,roundsumrZr)resultsper_language_ratiosresult sub_resultr/r`merges r(merge_coherence_ratiosr#s 358  8J(OHe2216#H-  ) 0 0 7  88 ,    '12S9LX9V5WW   E  %^T :: s0A:ct|D]6}|\}}|jdd}|vrg|<|j|8tfdDr*g}D]!}|j|t |f#|S|S)u We shall NOT return "English—" in CoherenceMatches because it is an alternative of "English". This function only keeps the best match and remove the em-dash in it. u—c3@K|]}t|dkDyw)rN)rZ)rNe index_resultss r(rOz/filter_alt_coherence_matches..Os <3}Q 1 $ .qs"=A1"=sg?rrerfc |dSrQrRrSs r(rUz!coherence_ratio..s QqTr3rV) splitremoverr most_commonrr rbr}r,rrr)r threshold lg_inclusionrr[sufficient_match_countlg_inclusion_listlayersequence_frequenciesrr$r^rpopular_character_orderedr/r`s r(coherence_ratiorZsB(*G""#3?3K **3/QS))  /$%5618*668 ""="== 0 0 >I/Jda/J!/J) -? %'7.  H93Ey #&!+& NNHeE1o6 7%* 8 $W->4 '0Ks= C;)r strreturn list[str])r-rrr)r/rrztuple[bool, bool])F)r0rr[boolrr)r/rrmrrfloat)rrrr)rzlist[CoherenceMatches]rr)rrrr)g?N)rrrrrz str | Nonerr)& __future__rrcodecsr collectionsr functoolsrtyping TypeCounterconstantrr r r r mdr modelsrutilsrrrrrr)r2r9rErJrbr}rrrrrRr3r(rs"%)1$"J  2 2&  ( +, 2- 2"5: I I-1 I IFM>M>'0M> M>`$!N;86 4NR00&+0AK000r3