JL ixdZddlmZddlmZ ddlZGddZdZ e dk(re yy#e$rdZY$wxYw) a A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization". The algorithm takes advantage of Zipf's law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure. Language n-grams are provided by the "An Crubadan" project. A corpus reader was created separately to read those files. For details regarding the algorithm, see: https://www.let.rug.nl/~vannoord/TextCat/textcat.pdf For details about An Crubadan, see: https://borel.slu.edu/crubadan/index.html )maxsize)trigramsNcDeZdZdZiZdZdZiZdZdZ dZ dZ dZ d Z y) TextCatN<>cts tdddlm}||_|jj D]}|jj |y)Nzclassify.textcat requires the regex module that supports unicode. Try '$ pip install regex' and see https://pypi.python.org/pypi/regex for further details.r)crubadan)reOSError nltk.corpusr _corpuslangs lang_freq)selfr langs [/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/classify/textcat.py__init__zTextCat.__init__7sQ#  ) LL&&( )D LL " "4 ( )c0tjdd|S)z)Get rid of punctuation except apostrophesz [^\P{P}\']+)r subrtexts rremove_punctuationzTextCat.remove_punctuationGsvvnb$//rc0ddlm}m}|j|}||}|}|D]c}t |j |z|j z}|D cgc]} dj| } } | D]} | |vr|| xxdz cc<d|| <e|Scc} w)z'Create FreqDist of trigrams within textr)FreqDist word_tokenizer)nltkrrrr _START_CHAR _END_CHARjoin) rrrr clean_texttokens fingerprintttoken_trigram_tuplestritoken_trigrams cur_trigrams rprofilezTextCat.profileKs0,,T2 z*j  1A#+D,<,>,Q#R 6JKsbggclKNK- 1 +- ,1,/0K ,  1  1LsBc|jj|}d}||vr`t|jj |}t|jj |}t ||z }|St }|S)zgCalculate the "out-of-place" measure between the text and language profile for a single trigramr)rrlistkeysindexabsr)rrtrigram text_profilelang_fddistidx_lang_profileidx_texts r calc_distzTextCat.calc_dist_s,,((. g #GLLN399'B L--/066w?H'(23D D rci}|j|}|jjjD]&}d}|D]}||j |||z }|||<(|S)zOCalculate the "out-of-place" measure between the text and all languagesr)r,r_all_lang_freqr/r8)rr distancesr,r lang_distr2s r lang_distszTextCat.lang_diststs{ ,,t$LL//446 (DI" DT^^D'7CC  D(IdO (rc|j||_t|j|jjS)zYFind the language with the min distance to the text and return its ISO 639-3 code)key)r=last_distancesmingetrs rguess_languagezTextCat.guess_languages4#ood34&&D,?,?,C,CDDr)__name__ __module__ __qualname__r fingerprintsr!r"r@rrr,r8r=rCrrrr/s:GLKIN) 0(*$Errc ddlm}gd}dddddd d d d d }t}|D]}|j|}t |dz }t t t|}d}td|D]<} ddjtd|| D cgc] } || |  c} z} || z }>td|ddzdz|j|} td| d|| dtdycc} w)Nr)udhr) z Kurdish-UTF8z Abkhaz-UTF8zFarsi_Persian-UTF8z Hindi-UTF8z Hawaiian-UTF8z Russian-UTF8zVietnamese-UTF8zSerbian_Srpski-UTF8zEsperanto-UTF8zNorthern Kurdish AbkhazianzIranian PersianHindiHawaiianRussian VietnameseSerbian Esperanto) kmrabkpeshinhawrusviesrpeporr zLanguage snippet: z...zLanguage detection: z ()z############################################################################################################################################) r rJrsentslenr.mapranger#printrC) rJrfriendlytccur_lang raw_sentencesrowscolssampleijcur_sentguesss rdemorns&  E"  H B 8, =!A%C]+,q$ ASXXE!TRSWDU&Vq}Q'7':&VWWH h F  "VAc]2U:;!!&) $UG2huo->a@A i#'Ws C3__main__) __doc__sysr nltk.utilrregexr ImportErrorrrnrDrHrrrusZ* \E\E@.b zFq Bs /99