JL iHydZddlmZmZddlZddlmZmZmZm Z m Z m Z m Z m Z ddlmZdZej"j%e dZGd d ZGd d Zy) zC Unit tests for nltk.tokenize. See also nltk/test/tokenize.doctest )ListTupleN)LegalitySyllableTokenizerStanfordSegmenterSyllableTokenizerTreebankWordTokenizerTweetTokenizerpunkt sent_tokenize word_tokenize) CharTokenizerc| t}|jd|jdy#t$rYywxYw)NarzhTF)rdefault_config LookupError)segs b/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/test/unit/test_tokenize.pyload_stanford_segmenterrs@! 4  4  s ,/ ;;z/NLTK was unable to find stanford-segmenter.jar.)reasonceZdZdZej j ddgdgdffdgdgdffd gd gd ffd gd gd ffd gdgdffdgdgdffdgdgdffdgdgdffgdedee ee effdZ dZ dZ dZ edZed Zd!Zd"Zd#Zd$Zd%Zd&Zd'Zd(Zd)Zd*Zd+Zd,Zej j d-gd.dOd0Zd1Zej j d2d3gd4fd5gd6fd7d7gfd8d8gfd9d:d;gfdgfd?gd@fdAdBd;gfdCdDd;gfdEgdFfdGgdFfdHdIdJgfg dKedLe efdMZdPdNZ y/)Q TestTokenizecZtdd}d}|j|}gd}||k(sJy)zW Test TweetTokenizer using words with special and accented characters. T) strip_handles reduce_lenuA@myke: Let's test these words: resumé España München français) :zLet'stestthesewordsruresuméuEspañauMünchenu françaisNr tokenize)self tokenizers9tokensexpecteds rtest_tweet_tokenizerz!TestTokenize.test_tweet_tokenizer(s< #$G P##B'  !!!ztest_input, expectedsz#My text 0106404243030 is great text)Mytext 0106404243030isgreatr*)r)r* 0106404243030r-r.r*zMy ticket id is 1234543124123)r)ticketidr- 1234543124123)r)r0r1r- 1234543124123z<@remy: This is waaaaayyyy too much for you!!!!!! 01064042430) rThisr-waaayyytoomuchforyou!r;r;r+z*My number is 06-46124080, except it's not.) r)numberr-z 06-46124080,exceptit'snot.z+My number is 601-984-4813, except it's not.) r)r<r-z 601-984-4813r=r>r?r@rA) r)r<r-z601-984-4813r=r>r?r@rAz/My number is (393) 928 -3010, except it's not.) r)r<r-(393) 928 -3010r=r>r?r@rA)r)r<r-(393)928-3010r=r>r?r@rAz1The product identification number is 48103284512.)Theproductidentificationr<r- 48103284512rA)rJrKrLr<r- 48103284512rAz(My favourite substraction is 240 - 1353.)r) favourite substractionr-z 240 - 1353rA)r)rPrQr-240rH1353rA test_input expectedsc|tddg|D]+\}}tdd|}|j|}||k(r+Jy)a Test `match_phone_numbers` in TweetTokenizer. Note that TweetTokenizer is also passed the following for these tests: * strip_handles=True * reduce_len=True :param test_input: The input string to tokenize using TweetTokenizer. :type test_input: str :param expecteds: A 2-tuple of tokenized sentences. The first of the two tokenized is the expected output of tokenization with `match_phone_numbers=True`. The second of the two tokenized lists is the expected output of tokenization with `match_phone_numbers=False`. :type expecteds: Tuple[List[str], List[str]] TF)rrmatch_phone_numbersN)zipr r!)r"rTrUrWr&r# predicteds rtest_tweet_tokenizer_expandedz*TestTokenize.test_tweet_tokenizer_expanded>sXH.1$ -J ) ) &"$7I "**:6I( (( )r(cLt}|jd}|gdk(sJy)3 Test SyllableTokenizer tokenizer. justification)justificationNrr!)r"r#r%s r+test_sonority_sequencing_syllable_tokenizerz8TestTokenize.test_sonority_sequencing_syllable_tokenizer s+&' ##O4::::r(cTt}ddz}|j|}||gk(sJy)r\9i'Nrc)r"r#r*r%s rtest_syllable_tokenizer_numbersz,TestTokenize.test_syllable_tokenizer_numberss5&' U{##D)$r(czddlm}d}t|j}|j|}|gdk(sJy)z; Test LegalitySyllableTokenizer tokenizer. r)r wonderful)wonderfulN) nltk.corpusrrr!)r"r test_wordr#r%s r*test_legality_principle_syllable_tokenizerz7TestTokenize.test_legality_principle_syllable_tokenizers; & -ekkm< ##I.....r(ct}|jdd}|j|j}|jgdk(sJy)zN Test the Stanford Word Segmenter for Arabic (default config) runيبحث علم الحاسوب استخدام الحوسبة بجميع اشكالها لحل المشكلات) uيبحثuعلمuالحاسوبuاستخدامuالحوسبةuبuجميعu اشكالuهاuلuحلuالمشكلاتNrrsegmentsplitr"rsentsegmented_sents rtest_stanford_segmenter_arabicz+TestTokenize.test_stanford_segmenter_arabic'sS  ! 4 TZZ\2##% *   r(ct}|jdd}|j|j}|jgdk(sJy)zO Test the Stanford Word Segmenter for Chinese (default config) ru$这是斯坦福中文分词器测试)u这u是u 斯坦福u中文u 分词器u测试Nrqrts rtest_stanford_segmenter_chinesez,TestTokenize.test_stanford_segmenter_chinese?sS  ! 4 5TZZ\2##%*    r(ct}d}dg}|j|}||k(sJd}gd}|j|}||k(sJy)zT Test a string that resembles a phone number but contains a newline rCz(393) 928 -3010)rDrErFz 928 -3010Nr )r"r#test1r&resulttest2s rtest_phone_tokenizerz!TestTokenize.test_phone_tokenizerQsa #$ "&'##E*!!!#1##E*!!!r(c|t}d}dg}|j|}||k(sJd}dg}|j|}||k(sJd}gd}|j|}||k(sJd}gd}|j|}||k(sJd}gd}|j|}||k(sJd } gd }|j| }||k(sJy ) zX Test a string that contains Emoji ZWJ Sequences and skin tone modifier u👨‍👩‍👧‍👧u👨🏿u🤔 🙈 me así, se😌 ds 💕👭👙 hello 👩🏾‍🎓 emoji hello 👨‍👩‍👦‍👦 how are 😊 you today🙅🏽🙅🏽)u🤔u🙈meuasír=seu😌dsu💕u👭u👙hellou👩🏾‍🎓emojiru👨‍👩‍👦‍👦howareu😊r:today🙅🏽ru🇦🇵🇵🇱🇪)u🇦🇵🇵🇱u🇪uHi 🇨🇦, 😍!!)Hi🇨🇦r=u😍r;r;u<3 🇨🇦 🤝 🇵🇱 <3)<3ru🤝rrNr ) r"r#r{r&r|r}test3test4test5test6s rtest_emoji_tokenizerz!TestTokenize.test_emoji_tokenizerds#$ ,/0##E*!!!<##E*!!!_ 2##E*!!!'3##E*!!!%<##E*!!!.?##E*!!!r(c0d}gd}t||k(sJy)zA Test padding of asterisk for word tokenization. z1This is a, *weird sentence with *asterisks in it.) r5r-ar=*weirdsentencewithr asterisksinitrANr r"r*r&s rtest_pad_asteriskzTestTokenize.test_pad_asterisks&C T"h...r(c0d}gd}t||k(sJy)z@ Test padding of dotdot* for word tokenization. zPWhy did dotdot.. not get tokenized but dotdotdot... did? How about manydots.....)Whydiddotdotz..r@get tokenizedbut dotdotdotz...r?Howaboutmanydotsz.....Nrrs rtest_pad_dotdotzTestTokenize.test_pad_dotdots&b $T"h...r(ctd}d}gd}|j|}||k(sJd}gd}|j|}||k(sJd}gd}|j|}||k(sJd }gd }|j|}||k(sJd }gd }|j|}||k(sJd } gd}|j| }||k(sJd} gd}|j| }||k(sJy)zW Test remove_handle() from casual.py with specially crafted edge cases T)rz-@twitter hello @twi_tter_. hi @12345 @123news)rrAhiu]@n`@n~@n(@n)@n-@n=@n+@n\@n|@n[@n]@n{@n}@n;@n:@n'@n"@n/@n?@n.@n,@n<@n>@n @n @n ñ@n.ü@n.ç@n.)`~rDrFrH=+\|[]{};r'"/rrAr=<>ñrAürAçrAzKa@n j@n z@n A@n L@n Z@n 1@n 4@n 7@n 9@n 0@n _@n !@n @@n #@n $@n %@n &@n *@n)&r@njrzrArLrZr1r4r7rrfr0r_rr;r@r#r$r%r&rrrz@n!a @n#a @n$a @n%a @n&a @n*a) r;rrrrrrrrrrrzD@n!@n @n#@n @n$@n @n%@n @n&@n @n*@n @n@n @@n @n@@n @n_@n @n7@n @nj@n)r;rrrrrrrrrrrrrrrrrrz@n_rz@n7rz@njrz^@abcdefghijklmnopqrstuvwxyz @abcdefghijklmno1234 @abcdefghijklmno_ @abcdefghijklmnoendofhandle) pqrstuvwxyz1234r endofhandlez^@abcdefghijklmnop@abcde @abcdefghijklmno@abcde @abcdefghijklmno_@abcde @abcdefghijklmno5@abcde)p@abcdez@abcdefghijklmnorrr5rNr ) r"r#r{r&r|r}rrrrtest7s rtest_remove_handlezTestTokenize.test_remove_handlesK #6 @'##E*!!!s >##E*!!!^' P##E*!!!0O##E*!!!W 6##E*!!!q>##E*!!!q  ##E*!!!r(ct}d}gd}t|j|}||k(sJd}gd}t|j|}||k(sJd}gd}t|j|}||k(sJy)zC Test TreebankWordTokenizer.span_tokenize function zNGood muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).))r) ) ))r))) )r$)r%)r&)(.)/0)r3)r4)57)8;)<>)?D)EF)rL)rM)rNzmThe DUP is similar to the "religious right" in the United States and takes a hardline stance on social issues)rr  rrrrrrr*r+,rr2r9:@ArrJKr)rU)V\)]_)`f)gmzqThe DUP is similar to the "religious right" in the United States and takes a ``hardline'' stance on social issues)rrrrrrr r r rrrrrrrr)rO)r)W)r*Y)Zr%)ac)dj)kqN)rlist span_tokenize)r"r#r{r&r|r}rs rtest_treebank_span_tokenizerz)TestTokenize.test_treebank_span_tokenizerms *+ d 2i--e45!!!@ 0i--e45!!!F 4i--e45!!!r(c\d}gd}t||k(sJd}gd}t||k(sJy)z- Test word_tokenize function z0The 'v', I've been fooled but I'll seek revenge.)rJrvrr=Iz'vebeenfooledrr8z'llseekrevengerAz'v' 're')rr7rz'rerNr)r"rr&s rtest_word_tokenizezTestTokenize.test_word_tokenizesC F "X&(222.X&(222r(cdddgfdgdfdgdfg}|D].\}}tj|Dcgc]}|}}||k(r.Jycc}w)N12rrN)rNN123)r@rN3)rCNr)r@rB)rCr)rN)r _pair_iter)r" test_casesrTexpected_outputx actual_outputs rtest_punkt_pair_iterz!TestTokenize.test_punkt_pair_itersm J , - 9 : F G ,6 4 'J(-(8(8(DE1QEME O3 33 4Es AcZtg}tj|}t|yN)iterr rDr3)r"rgens r5test_punkt_pair_iter_handles_stop_iteration_exceptionzBTestTokenize.test_punkt_pair_iter_handles_stop_iteration_exceptions" "Xr" S r(ctj}Gdd}||_t|j dy)NceZdZdZy)kTestTokenize.test_punkt_tokenize_words_handles_stop_iteration_exception..TestPunktTokenizeWordsMockctgSrK)rL)r"ss rr zyTestTokenize.test_punkt_tokenize_words_handles_stop_iteration_exception..TestPunktTokenizeWordsMock.word_tokenizes Bxr(N)__name__ __module__ __qualname__r r(rTestPunktTokenizeWordsMockrQs r(rXr)r PunktBaseClass _lang_varsr3_tokenize_words)r"objrXs r:test_punkt_tokenize_words_handles_stop_iteration_exceptionzGTestTokenize.test_punkt_tokenize_words_handles_stop_iteration_exceptions9""$  45 S  ()r(cGddtj}tj|}d}gd}|j||k(sJy)NceZdZdZy)NTestTokenize.test_punkt_tokenize_custom_lang_vars..BengaliLanguageVars)rArr;u।NrTrUrVsent_end_charsrWr(rBengaliLanguageVarsr` s6Nr(rc) lang_varscউপরাষ্ট্রপতি শ্রী এম ভেঙ্কাইয়া নাইডু সোমবার আই আই টি দিল্লির হীরক জয়ন্তী উদযাপনের উদ্বোধন করেছেন। অনলাইনের মাধ্যমে এই অনুষ্ঠানে কেন্দ্রীয় মানব সম্পদ উন্নয়নমন্ত্রী শ্রী রমেশ পোখরিয়াল ‘নিশাঙ্ক’ উপস্থিত ছিলেন। এই উপলক্ষ্যে উপরাষ্ট্রপতি হীরকজয়ন্তীর লোগো এবং ২০৩০-এর জন্য প্রতিষ্ঠানের লক্ষ্য ও পরিকল্পনার নথি প্রকাশ করেছেন।)uউপরাষ্ট্রপতি শ্রী এম ভেঙ্কাইয়া নাইডু সোমবার আই আই টি দিল্লির হীরক জয়ন্তী উদযাপনের উদ্বোধন করেছেন।u+অনলাইনের মাধ্যমে এই অনুষ্ঠানে কেন্দ্রীয় মানব সম্পদ উন্নয়নমন্ত্রী শ্রী রমেশ পোখরিয়াল ‘নিশাঙ্ক’ উপস্থিত ছিলেন।u/এই উপলক্ষ্যে উপরাষ্ট্রপতি হীরকজয়ন্তীর লোগো এবং ২০৩০-এর জন্য প্রতিষ্ঠানের লক্ষ্য ও পরিকল্পনার নথি প্রকাশ করেছেন।)r PunktLanguageVarsPunktSentenceTokenizerr!)r"rcr\ sentencesr&s r$test_punkt_tokenize_custom_lang_varsz1TestTokenize.test_punkt_tokenize_custom_lang_vars sR 7%"9"9 7**5H5JKz   ||I&(222r(cbtj}d}dg}|j||k(sJy)Nre)r rgr!)r"r\rhr&s r'test_punkt_tokenize_no_custom_lang_varsz4TestTokenize.test_punkt_tokenize_no_custom_lang_varss>**,z  r  ||I&(222r(z%input_text,n_sents,n_splits,lang_vars))z4Subject: Some subject. Attachments: Some attachments)z4Subject: Some subject! Attachments: Some attachmentsrlrm)z4This is just a normal sentence, just like any other.rmrNctj}|dk7r||_t|j ||k(sJtt |j ||k(sJyrK)r rgrZlenr!r3debug_decisions)r" input_textn_sentsn_splitsrdr#s rpunkt_debug_decisionsz"TestTokenize.punkt_debug_decisions&sc$002  #,I 9%%j12g===4 11*=>?8KKKr(cjGddtj}|jddd|y)NceZdZdZy)GTestTokenize.test_punkt_debug_decisions_custom_end..ExtLangVars)rArr;^NrarWr(r ExtLangVarsrwBs1Nr(ryz4Subject: Some subject^ Attachments: Some attachmentsrlrm)rrrsrd)r rfrt)r"rys r%test_punkt_debug_decisions_custom_endz2TestTokenize.test_punkt_debug_decisions_custom_end?s7 2%11 2 "" B!m # r(zsentences, expectedzthis is a test. . new sentence.)zthis is a test.rAz new sentence.zThis. . . That)This.rArAThatzThis..... Thatz This... Thatz This.. . ThatzThis.. .r|z This. .. Thatr{z.. Thatz This. ,. That)r{z,.r|z This!!! ThatzThis!!!z This! ThatzThis!z+1. This is R . 2. This is A . 3. That's all)z1.z This is R .z2.z This is A .z3.z That's allz+1. This is R . 2. This is A . 3. That's allz Hello. TherezHello.Thererhr&c$t||k(sJyrK)r )r"rhr&s rtest_sent_tokenizezTestTokenize.test_sent_tokenizeNs6Y'8333r(cd}t}|j|t|k(sJt|j|gdk(sJy)Nz Hello there) )rrm)rmrl)rlr)rr)rr)r)rr)rr)r )rr)rr)r r!r3r4)r"rr#s rtest_string_tokenizerz"TestTokenize.test_string_tokenizerksR !O !!(+tH~===I++H56 ;   r(rKreturnN)!rTrUrVr'pytestmark parametrizestrrrrZrdrgrocheck_stanford_segmenterrwryr~rrrrr5r=rIrNr]rirkrtrzrrrWr(rrr's", [[6NJ 0EA O   L=   <>   >B  FD  6;PV Mn qd))*/S 490D*E)eqd)6;  /  .  ""&?"B/,/2W"r_"B3: 4 *3" 3 [[/ "L#"L   [[29 : ;  01 2 n- . z62 3 w 2 3 5 6 i0 1 GV, -?N  @N x1 2+ 44C4494544 r(rc$eZdZddZddZddZy)TestPunktTrainerNcNtj}|jdy)NzThis is a test.r PunktTrainertrainr"trainers rtest_punkt_trainz!TestPunktTrainer.test_punkt_trains$$& '(r(cNtj}|jdy)Nr{rrs rtest_punkt_train_single_wordz-TestPunktTrainer.test_punkt_train_single_words$$& gr(cNtj}|jdy)NzThis is a testrrs rtest_punkt_train_no_puncz)TestPunktTrainer.test_punkt_train_no_puncs$$& &'r(r)rTrUrVrrrrWr(rrr~s)(r(r)__doc__typingrrr nltk.tokenizerrrrr r r r nltk.tokenize.simpler rrskipifrrrrWr(rrsj     /";;--!! <. T  T  n ( (r(