JL i dZddlZddlZddlZddlmZddlmZmZm Z m Z m Z m Z m Z mZddlmZddlmZdZ dZ d Z d Z d Z d Z eezezZ eezezZ eeeeeed Z dZdZdZdZdZ dZ!dZ"GddZ#ejHdejJZ& dZ'GddZ(GddZ)GddZ*Gdde*Z+Gd d!e*eZ,Gd"d#e,Z-d$Z.d)d%Z/d&Z0d'Z1e,e+fd(Z2y)*a} Punkt Sentence Tokenizer This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used. The NLTK data package includes a pre-trained Punkt tokenizer for English. >>> from nltk.tokenize import PunktTokenizer >>> text = ''' ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... ''' >>> sent_detector = PunktTokenizer() >>> print('\n-----\n'.join(sent_detector.tokenize(text.strip()))) Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. ----- And sometimes sentences can start with non-capitalized words. ----- i is a good variable name. (Note that whitespace from the original text, including newlines, is retained in the output.) Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag. >>> text = ''' ... (How does it deal with this parenthesis?) "It should be part of the ... previous sentence." "(And the same with this one.)" ('And this one!') ... "('(And (this)) '?)" [(and this. )] ... ''' >>> print('\n-----\n'.join( ... sent_detector.tokenize(text.strip()))) (How does it deal with this parenthesis?) ----- "It should be part of the previous sentence." ----- "(And the same with this one.)" ----- ('And this one!') ----- "('(And (this)) '?)" ----- [(and this. )] >>> print('\n-----\n'.join( ... sent_detector.tokenize(text.strip(), realign_boundaries=False))) (How does it deal with this parenthesis? ----- ) "It should be part of the previous sentence. ----- " "(And the same with this one. ----- )" ('And this one! ----- ') "('(And (this)) '? ----- )" [(and this. ----- )] However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use ``PunktSentenceTokenizer(text)`` to learn parameters from the given text. :class:`.PunktTrainer` learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a ``PunktTrainer`` directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc. The algorithm for this tokenizer is described in:: Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525. N) defaultdict)AnyDictIteratorListMatchOptionalTupleUnionFreqDist) TokenizerI @))initialupper)internalr)unknownr)rlower)rr)rrzdefault decisionzknown collocation (both words)z%abbreviation + orthographic heuristicz(abbreviation + frequent sentence starterz initial + orthographic heuristicz(initial + special orthographic heuristicceZdZdZdZdZdZdZ edZ dZ e jde jZ d Z ed Z d Z d Z d ZdZdZ dZy)PunktLanguageVarsaX Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors. )_re_period_context_re_word_tokenizercyNselfs Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tokenize/punkt.py __getstate__zPunktLanguageVars.__getstate__scyrr!)r#states r$ __setstate__zPunktLanguageVars.__setstate__sr&).?!cddtjdj|jzS)Nz[%s])reescapejoinsent_end_charsr"s r$_re_sent_end_charsz$PunktLanguageVars._re_sent_end_charss% "''$*=*=">???r&z,:;z["\')\]}]+?(?:\s+|(?=--)|$)z[^\(\"\`{\[:;&\#\*@\)}\]\-,]c~dtjdjt|jdhz zS)Nz(?:[)\";}\]\*:@\'\({\[%s])r.r*)r/r0r1setr2r"s r$_re_non_word_charsz$PunktLanguageVars._re_non_word_charss8,ryy GGC++,u4 50   r&z (?:\-{2,}|\.{2,}|(?:\.\s){2,}\.)a( %(MultiChar)s | (?=%(WordStart)s)\S+? # Accept word characters until end is found (?= # Sequences marking a word's end \s| # White-space $| # End-of-string %(NonWord)s|%(MultiChar)s| # Punctuation ,(?=$|\s|%(NonWord)s|%(MultiChar)s) # Comma if at end of word ) | \S )c$ |jS#t$rxtj|j|j |j |jdztjtjz|_|jcYSwxYw)z?Compiles and returns a regular expression for word tokenization)NonWord MultiChar WordStart) rAttributeErrorr/compile_word_tokenize_fmtr6_re_multi_char_punct_re_word_startUNICODEVERBOSEr"s r$_word_tokenizer_rez$PunktLanguageVars._word_tokenizer_res +** * +&(jj''#66!%!:!:!%!4!4  RZZ''D #** * +s A>BBc@|jj|S)z=Tokenize a string to split off punctuation other than periods)rBfindall)r#ss r$ word_tokenizezPunktLanguageVars.word_tokenize s&&(0033r&a %(SentEndChars)s # a potential sentence ending (?=(?P %(NonWord)s # either other punctuation | \s+(?P\S+) # or whitespace and some other token ))c |jS#tj|j|j|j dztj tjz|_|jcYSxYw)zjCompiles and returns a regular expression to find contexts including possible sentence boundaries.)r8 SentEndChars)rr/r<_period_context_fmtr6r3r@rAr"s r$period_context_rez#PunktLanguageVars.period_context_resp +** * +&(jj((#66$($;$;  RZZ' 'D #** *s A,A<N)__name__ __module__ __qualname____doc__ __slots__r%r)r2propertyr3internal_punctuationr/r< MULTILINEre_boundary_realignmentr?r6r>r=rBrFrIrJr!r&r$rrs=I %NA @@!))bjj)GV15N<   5>=  + 4L+r&rz[^\W\d]c#Kt|} t|}|D] }||f|} |dfy#t$rYywxYww)z Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element. N)iternext StopIteration)iteratorprevels r$ _pair_iterr[9s^ H~HH~Rj , s ? 0? <?<?c:eZdZdZdZdZdZdZdZdZ dZ y ) PunktParameterszCStores data used to perform sentence boundary detection with Punkt.ct|_ t|_ t|_ t t |_yN)r5 abbrev_types collocations sent_startersrint ortho_contextr"s r$__init__zPunktParameters.__init__RsCE:E  !U #)- 5r&c"t|_yr_)r5r`r"s r$ clear_abbrevszPunktParameters.clear_abbrevsf Er&c"t|_yr_)r5rar"s r$clear_collocationsz"PunktParameters.clear_collocationsirhr&c"t|_yr_)r5rbr"s r$clear_sent_startersz#PunktParameters.clear_sent_startersls  Ur&c,tt|_yr_)rrcrdr"s r$clear_ortho_contextz#PunktParameters.clear_ortho_contextos(-r&c2|j|xx|zcc<yr_)rd)r#typflags r$add_ortho_contextz!PunktParameters.add_ortho_contextrs 34'r&c#K|j|}|tzrd|tzrd|tzrd|tzrd|t zrd|t zrdyyw)NzBEG-UCzMID-UCzUNK-UCzBEG-LCzMID-LCzUNK-LC)rd _ORTHO_BEG_UC _ORTHO_MID_UC _ORTHO_UNK_UC _ORTHO_BEG_LC _ORTHO_MID_LC _ORTHO_UNK_LC)r#rpcontexts r$_debug_ortho_contextz$PunktParameters._debug_ortho_contextusi$$S) ] "N ] "N ] "N ] "N ] "N ] "N #sA A"N) rKrLrMrNrergrjrlrnrrr{r!r&r$r]r]Os(M5(""#.( r&r]ceZdZdZgdZgdezZdZejdZ ejdZ ejdejZ ejdejZ d Zed Zed Zed Zed ZedZedZedZedZedZedZdZdZy) PunktTokenzXStores a token of text with annotations produced during sentence boundary detection.) parastart linestart sentbreakabbrellipsis)toktype period_finalc ||_|j||_|jd|_|j D]}t ||d|D]}t ||||y)Nr*)r _get_typerendswithr _propertiessetattr)r#rparamspropks r$rezPunktToken.__init__sjNN3' LL-$$ &D D$ % & (A D!VAY ' (r&z\.\.+$z^-?[\.,]?\d[\d,\.-]*\.?$z [^\W\d]\.$z [^\W\d]+$cV|jjd|jS)z6Returns a case-normalized representation of the token. ##number##) _RE_NUMERICsubr)r#rs r$rzPunktToken._get_types!##L#))+>>r&ct|jdkDr!|jddk(r|jddS|jS)zG The type with its final period removed if it has one. r r*N)lenrr"s r$type_no_periodzPunktToken.type_no_periods= tyy>A $))B-3"699Sb> !yyr&cJ|jr |jS|jS)ze The type with its final period removed if it is marked as a sentence break. )rrrr"s r$type_no_sentperiodzPunktToken.type_no_sentperiods! >>&& &yyr&c<|jdjS)z1True if the token's first character is uppercase.r)risupperr"s r$ first_upperzPunktToken.first_upperxx{""$$r&c<|jdjS)z1True if the token's first character is lowercase.r)rislowerr"s r$ first_lowerzPunktToken.first_lowerrr&c8|jry|jryy)Nrrnone)rrr"s r$ first_casezPunktToken.first_cases      r&cL|jj|jS)z.True if the token text is that of an ellipsis.) _RE_ELLIPSISmatchrr"s r$ is_ellipsiszPunktToken.is_ellipsiss  &&txx00r&c8|jjdS)z+True if the token text is that of a number.r)r startswithr"s r$ is_numberzPunktToken.is_numbersyy##L11r&cL|jj|jS)z-True if the token text is that of an initial.) _RE_INITIALrrr"s r$ is_initialzPunktToken.is_initials%%dhh//r&cL|jj|jS)z)True if the token text is all alphabetic.) _RE_ALPHArrr"s r$is_alphazPunktToken.is_alphas~~##DHH--r&c@tj|jS)z6True if the token is either a number or is alphabetic.) _re_non_punctsearchrr"s r$ is_non_punctzPunktToken.is_non_puncts##DII..r&c(jjk7rdtjznd}djfdjD}dj j jtj||S)z A string representation of the token that can reproduce it with eval(), which lists all the token's non-default annotations. z type=%s,r.z, c 3jK|]*}t|r|dtt|,yw)=N)getattrrepr).0pr#s r$ z&PunktToken.__repr__..s; tQc4a()* + s03z {}({},{} {}))rrrr1rformat __class__rK)r#typestrpropvalss` r$__repr__zPunktToken.__repr__s| 48993H+TYY/b99 %%   $$ NN # # N     r&c|j}|jr|dz }|jr|dz }|jr|dz }|S)zO A string representation akin to that used by Kiss and Strunk. zzz)rrrr)r#ress r$__str__zPunktToken.__str__sBhh 99 5LC == 5LC >> 5LC r&N)rKrLrMrNrrOrer/r<rrr@rrrrPrrrrrrrrrrrrr!r&r$r}r}sD$NK/+=I(2::i(L"**89K"**]BJJ7K <4I ?%%%%112200..// * r&r}cPeZdZdZdedfdZdZdeedeefdZdeddfd Z y) PunktBaseClasszP Includes common components of PunktTrainer and PunktSentenceTokenizer. Nc^| t}| t}||_||_||_yr_)rr]_params _lang_vars_Token)r# lang_vars token_clsrs r$rezPunktBaseClass.__init__s7  )+I >$&F #  #r&c#DKd}|jdD]w}|jrct|jj |} t |}|j||dd}|D]}|j|vd}yy#t $rYwxYww)aB Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively. F T)r~rN)splitstriprUrrFrVrWr)r# plaintextr~line line_toksrs r$_tokenize_wordszPunktBaseClass._tokenize_words*s OOD) !Dzz| !>!>t!DE y/Ckk#dkKK! $+C++c**+!  ! %s*A B  B8B  BB BB tokensreturnc#DK|D]}|j||yw)a Perform the first pass of annotation, which makes decisions based purely based on the word type of each word: - '?', '!', and '.' are marked as sentence breaks. - sequences of two or more periods are marked as ellipsis. - any word ending in '.' that's a known abbreviation is marked as an abbreviation. - any other word ending in '.' is marked as a sentence break. Return these annotations as a tuple of three sets: - sentbreak_toks: The indices of all sentence breaks. - abbrev_toks: The indices of all abbreviations. - ellipsis_toks: The indices of all ellipsis marks. N)_first_pass_annotationr#raug_toks r$_annotate_first_passz#PunktBaseClass._annotate_first_passHs*& G  ' ' 0M s rc|j}||jjvrd|_y|jrd|_y|j r|jdss|ddj|jjvs;|ddjjdd|jjvrd|_ yd|_y)zC Performs type-based annotation on a single token. Tz..Nr-) rrr2rrrrrrrr`rr)r#rrs r$rz%PunktBaseClass._first_pass_annotation_s kk $//00 0 $G   #G   ! !#,,t*<CR DLL$=$==s8>>#))#.r2dll6O6OO#  %)!r&) rKrLrMrNr}rerrrrr!r&r$rrsL"&D #!<z* * .jTr&rceZdZdZdddefdZdZdZ dZ dZ dZ d Z dZ dZ d Z dd Zdd Zd ZdZddZ ddZdZdZdZdZdZedZedZdZdZdZdZ dZ!y) PunktTrainerz.s;qDKKN;N)rr)r#rrrs` r$ train_tokenszPunktTrainer.train_tokenss1 ;F;WE   " "7 + r&cd|_t|}|D]E}|j|jxxdz cc<|js1|xj dz c_G|j |}|j|D]\}}}||jk\r>|s|jjj||sAtd|dd|T|rW|jjj||std|dd|t|j|}|j||xj |j#|z c_t%|D]\}} |jr| s|j'|| rI|jjj|j(|rtd|jz|j+| |r!|j,| jxxdz cc<|j/|| s|j0|j(| j2fxxdz cc<y)NFr z Abbreviation: [6.4f] z Removed abbreviation: [z Rare Abbrev: %s)rlistrrrr _unique_types_reclassify_abbrev_typesABBREVrr`addprintremover_get_orthography_datar_get_sentbreak_countr[_is_rare_abbrev_typer_is_potential_sent_starterr_is_potential_collocationrr) r#rrr unique_typesrscoreis_addaug_tok1aug_tok2s r$rzPunktTrainer._train_tokenss&f  +G   W\\ *a / *##%%*% + ))&1 #'#@#@#N P D% #LL--11$7 1%RvFGLL--44T: 9%RvNO Pd//78 ""6* !:!:6!BB#-V"4  Hh((((8< ))--h.E.EF- =>..xB((71<7--hA'',,h.I.IJ! r&c@|Dchc]}|jc}Scc}wr_)rrs r$rzPunktTrainer._unique_types-s,23 333sc |jj|jD]?\}}|jjj ||s.t d|dd|A|jj |jD]G\\}}}|jjj ||f|s3t d|dd|d|Id|_ y)z~ Uses data that has been gathered in training to determine likely collocations and sentence starters. z Sent Starter: [rrz Collocation: [+TN) rrl_find_sent_startersrbrrrj_find_collocationsrar)r#rrplog_likelihoodtyp1typ2s r$rzPunktTrainer.finalize_training0s ((*#'#;#;#= J C LL & & * *3 /).)>bHI J ''),0,C,C,E S (LT4. LL % % ) )4, 7((=RxqQR S r&c|dkDrr|jj}|jj|jD]3}|j|}||k\s|||jj|<5|j |j||_|j |j ||_|j |j ||_y)a Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained. r N)rrdrnr_freq_thresholdrr)r# ortho_thresh type_thresh colloc_thressentstart_threshold_ocrcounts r$freq_thresholdzPunktTrainer.freq_thresholdGs ! \\//F LL , , .'' B((-L(6 count_removed). rr Nr )r#fdist thresholdr num_removedrrs r$rzPunktTrainer._freq_threshold`s]j  "C#JEy q CE!  " D [  r&cd}t|}|D]}|jr|dk7rd}|jr|dk(rd}|j}tj ||j fd}|r|jj|||jr|js|jsd}d}|js |jrd}d}y)z Collect information about whether each token type occurs with different case patterns (i) overall, (ii) at sentence-initial positions, and (iii) at sentence-internal positions. rrrrN)rr~rr _ORTHO_MAPgetrrrrrrrrr)r#rrzrrprqs r$rz"PunktTrainer._get_orthography_datavsf %G   W %9#  W %:#,,C>>7G,>,>"?CD ..sD9  ))W-?-?'G'G!!W\\#$? %r&c#K|D]B}tj|r|dk(r|jdr!||jjvrI|dd}d}n||jjvrjd}|j ddz}t ||z dz}|j|dz}|j|}|j||z|j||jj}tj| } |} t|jxstj|| } || z| z| z} || |fEyw)a (Re)classifies each given token if - it is period-final and not a known abbreviation; or - it is not period-final and is otherwise a known abbreviation by checking whether its previous classification still holds according to the heuristics of section 3. Yields triples (abbr, score, is_add) where abbr is the type in question, score is its log-likelihood with penalties applied, and is_add specifies whether the present type is a candidate for inclusion or exclusion as an abbreviation, such that: - (is_add and score >= 0.3) suggests a new abbreviation; and - (not is_add and score < 0.3) suggests excluding an abbreviation. rr*NrTFr )rrrrr`rrr_dunning_log_likelihoodrNmathexprcIGNORE_ABBREV_PENALTYpow) r#typesrpr  num_periodsnum_nonperiodscount_with_periodcount_without_periodrf_length f_periods f_penaltyr s r$rz%PunktTrainer._reclassify_abbrev_typessf$/ %C!'',|0C||C $,,333#2hdll777))C.1,K X 3a7N!% 0 0s ; #'#3#3C#8 !99!$88%%!  ""$ Nxx0H#ID667488!5 5<I#X- 9IEEuf$ $_/ %sE E c|jjd|jD}|j|D];\}}}||jk\s|jj j |=y)z Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as "rare". c3JK|]}|s|jds|yw)r*N)r)rrps r$rz1PunktTrainer.find_abbrev_types..sO#SS\\#=N#Os ###N)rrgrrrr`r)r#rrr _is_adds r$find_abbrev_typeszPunktTrainer.find_abbrev_typessi ""$O!1!1O$($A$A&$I 4 D% # ))--d3 4r&c|js |jsy|j}|j||j|ddz}||jj vs||j k\ry|jdd|jjvry|jr:|j}|jj|}|tzr |tzsyyyy)a A word type is counted as a rare abbreviation if... - it's not already marked as an abbreviation - it occurs fewer than ABBREV_BACKOFF times - either it is followed by a sentence-internal punctuation mark, *or* it is followed by a lower-case word that sometimes appears with upper case, but never occurs with lower case at the beginning of sentences. FNrr T)rrrrrr`ABBREV_BACKOFFrrrQrrdrtru)r#cur_toknext_tokrprrtyp2ortho_contexts r$rz!PunktTrainer._is_rare_abbrev_types <?)r)log) count_acount_bcount_abr(p1p2 null_hypoalt_hypo likelihoods r$r'z$PunktTrainer._dunning_log_likelihood(sq[ txxT 22g6HDHH "HtOM 6  dhhrl*g.@DHHSSUXDV-VV) j  r&c||z }||z } ||z ||z z } |tj|z||z tjd|z zz} ||z tj|z||z |z |ztjd|z zz}||k(s |dks|dk\rd} n7|tj|z||z tjd|z zz} ||k(s |dks|dk\rd} n@||z tj|z||z |z |ztjd|z zz} ||z| z | z } d| zS#t$rd}Y2wxYw#t$rd}Y wxYw#t$rd}YwxYw)a= A function that will just compute log-likelihood estimate, in the original paper it's described in algorithm 6 and 7. This *should* be the original Dunning log-likelihood values, unlike the previous log_l function where it used modified Dunning log-likelihood values r r?rr@)ZeroDivisionErrorr)rA ValueError) rBrCrDr(rrErFsummand1summand2summand3summand4rIs r$_col_log_likelihoodz PunktTrainer._col_log_likelihood;s aK   H$W5B $((1+-81CtxxPSVWPWGX0XXH (*dhhqk9G g%0q!=""H h "'R1WH$((2,.'H2DbI2H h "'R1WH(*dhhrl:G g%0r">##H(83h> j  ?! B   H  H s5 D;7E AE; E  E  EE E-,E-c|jxsD|jxr |jxs(|jxr|jxs |j xr|j xr |j S)zt Returns True if the pair of tokens may form a collocation given log-likelihood statistics. )INCLUDE_ALL_COLLOCSINCLUDE_ABBREV_COLLOCSrrrrr)r#r r s r$rz&PunktTrainer._is_potential_collocationnso((X//AHMMX&&VH,>,>,U(BUBU & %%  & %% r&c#fK|jD] } |\}}||jjvr#|j|}|j||j|dzz}|j||j|dzz}|dkDs||dkDs|j |cxkrt ||ksn|j||||jj}||jk\s|jj|z ||z kDs||f|fy#t$rYwxYww)zI Generates likely collocations and their log-likelihood. r*r N) r TypeErrorrrbrMIN_COLLOC_FREQminrQr( COLLOCATION)r#r-rr col_count typ1_count typ2_countrs r$rzPunktTrainer._find_collocations}s=,, 7E " dt||111//6I))$/$2B2B4#:2NNJ))$/$2B2B4#:2NNJQN((9SJ 8SS!%!9!9 It7G7G7I7I7K""T%5%55$$&&(:5 Y8NN,661 7  sFD1D!A1D1 D1D11>D10#D1 D1! D.*D1-D..D1cp|jxr)|jxs |j xr |jS)z Returns True given a token and the token that precedes it if it seems clear that the token is beginning a sentence. )rrrr)r#r;prev_toks r$rz'PunktTrainer._is_potential_sent_starters<    !''>8+>+>? !   r&c#K|jD]}|s|j|}|j||j|dzz}||kr=|j|j|||jj }||j k\s|jj |jz ||z kDs||fyw)z~ Uses collocation heuristics for each candidate token to determine if it frequently starts sentences. r*N)rrrQrr( SENT_STARTER)r#rptyp_at_break_count typ_countrs r$rz PunktTrainer._find_sent_starterss ++ *C!%!9!9#!> ((-0@0@s0KKI--!55%%"  ""$ N$"3"33$$&&(4+@+@@001>))- *sBC -C  C c&td|DS)zj Returns the number of sentence breaks marked in a given set of augmented tokens. c3:K|]}|jsdyw)r N)r)rrs r$rz4PunktTrainer._get_sentbreak_count..s@g.?.?1@s)sumr#rs r$rz!PunktTrainer._get_sentbreak_counts @F@@@r&)FTF)rrrr)"rKrLrMrNr}rerrr+r:rYr`rSrTrWrrrrrrrrrr8r staticmethodr'rQrrrrrr!r&r$rr{sFu &;PF<!NLKL& L#$  OO ,,;z40OP 2,*%`A%F 4)`!!$,!,!d  7D  *:Ar&rc eZdZdZdddefdZd dZd!dedede efd Z dede e ee ffd Z d!dedede eeeffd Z d!dedede efd Zdedefd Zdede eeeffdZdede efdZdede ede efdZdedefdZdede efdZde ede efdZde ede efdZdede ede efdZde eddfdZedZ de ede efdZ!dede"ede"efdZ#dede$eeffdZ%y)"PunktSentenceTokenizera' A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages. NFchtj||||r|j|||_yy)z train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object. rN)rrerrrs r$rezPunktSentenceTokenizer.__init__s3  YO ::j':DL r&ct|ts|St||j|jj S)z Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance. r) isinstancestrrrrr)r#rrs r$rzPunktSentenceTokenizer.trains7 *c*  $//T[[ *, r&rrealign_boundariesrc8t|j||S)zM Given a text, returns a list of the sentences in that text. )rsentences_from_text)r#rros r$tokenizezPunktSentenceTokenizer.tokenizesD,,T3EFGGr&c#K|j|D]\}}|j|}t|j|}|rx|djj |j jsF|jd|r3|djj |j jsF|jdz ||dj|djt|djt|dj|dj|jj v|j#|dt%|jj'|dj|dj|djf|jj(v|j+|d|dxst,|dj.d yw)z Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made. See format_debug_decision() to help make this output readable. rr ) period_indexrtype1type2type1_in_abbrstype1_is_initialtype2_is_sent_startertype2_ortho_heuristictype2_ortho_contexts collocationreasonbreak_decisionN)_match_potential_end_contextsrrrrrrr2popendrboolrrrrrb_ortho_heuristicr5r{ra_second_pass_annotationREASON_DEFAULT_DECISIONr)r#rr decision_textrs r$debug_decisionsz&PunktSentenceTokenizer.debug_decisionss%)$F$Ft$L  E=))-8F$33F;)/)E)E<<--*.)-)>)>vay)I(+LL55fQi6R6RS)1I001I00 <<,,  - 66vay&)L+*"()"5"5)   s B=G-D-G-c#K|j|}|r|j||}|D]}|j|jfyw)z^ Given a text, generates (start, end) spans of sentences in the text. N)_slices_from_text_realign_boundariesstartstop)r#rroslicessentences r$ span_tokenizez$PunktSentenceTokenizer.span_tokenize&sO''- --dF;F 2H>>8==1 1 2sAA c^|j||Dcgc] \}}||| c}}Scc}}w)z Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period. )r)r#rrorEes r$rqz*PunktSentenceTokenizer.sentences_from_text3s0'+&8&8?Q&RSdaQq SSSs)crtt|dz ddD]}||tjvs|cSy)z Given a text, find the index of the *last* occurrence of *any* whitespace character, i.e. " ", " ", " ", " ", etc. If none is found, return 0. r rr)rangerstring whitespace)r#ris r$_get_last_whitespace_indexz1PunktSentenceTokenizer._get_last_whitespace_index>s@ s4y1}b"- AAw&+++ r&c#JKtdd}d}|jjj|D]}||j|j }|j |}|r||jdzz }n |j }t||j }|rE|j|j kr,||||jz|jdzf|}|}|r-||||jz|jdzfyyw)a Given a text, find the matches of potential sentence breaks, alongside the contexts surrounding these sentence breaks. Since the fix for the ReDOS discovered in issue #2866, we no longer match the word before a potential end of sentence token. Instead, we use a separate regex for this. As a consequence, `finditer`'s desire to find non-overlapping matches no longer aids us in finding the single longest match. Where previously, we could use:: >>> pst = PunktSentenceTokenizer() >>> text = "Very bad acting!!! I promise." >>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +SKIP [] Now we have to find the word before (i.e. 'acting') separately, and `finditer` returns:: >>> pst = PunktSentenceTokenizer() >>> text = "Very bad acting!!! I promise." >>> list(pst._lang_vars.period_context_re().finditer(text)) # doctest: +NORMALIZE_WHITESPACE [, , ] So, we need to find the word before the match from right to left, and then manually remove the overlaps. That is what this method does:: >>> pst = PunktSentenceTokenizer() >>> text = "Very bad acting!!! I promise." >>> list(pst._match_potential_end_contexts(text)) [(, 'acting!!! I')] :param text: String of one or more sentences :type text: str :return: Generator of match-context tuples. :rtype: Iterator[Tuple[Match, str]] rNr after_tok)slicerrJfinditerrrrgroup)r#rprevious_sliceprevious_matchr before_textindex_after_last_spaceprev_word_slices r$rz4PunktSentenceTokenizer._match_potential_end_contextsIsANq!__668AA$G -E~22U[[]CK%)%D%D[%Q "%&.*=*=*AA&)7)=)=&#$:EKKMJO ."5"59N9N"N"($**,-$**;78 #N,N- -2 ^$ &&() &&{34  sD!D#c#NKd}|j|D]f\}}|j|st||j|j dr|j d}W|j}ht|t |jyw)Nrr<)rtext_contains_sentbreakrrrrrrstrip)r#r last_breakrrzs r$rz(PunktSentenceTokenizer._slices_from_texts "@@F -NE7++G4J 44;;z*!&Z!8J"'J -JDKKM 233s +B%A7B%rc #Kd}t|D]\}}t|j|z|j}|s ||r|5|jj j ||}|r\t|j|jt|jdjz|j}d}||s|yw)a@ Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence. For example: "(Sent1.) Sent2." will otherwise be split as:: ["(Sent1.", ") Sent1."]. This method will produce:: ["(Sent1.)", "Sent2."]. rN) r[rrrrrSrrrrr)r#rrrealign sentence1 sentence2ms r$rz*PunktSentenceTokenizer._realign_boundariess$.v$6 $ Iyioo7HI ?#O77==d9oNAIOOY__s1771:CTCTCV?W-WXX%%' ?#O $s CCCcxd}|j|j|D]}|ry|jsd}y)zK Returns True if the given text includes a sentence break. FT)_annotate_tokensrr)r#rfoundrs r$rz.PunktSentenceTokenizer.text_contains_sentbreaksE(()=)=d)CD C}}   r&cf|j|j|}|j||S)z Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as ``sentences_from_text``. )rr_build_sentence_list)r#rrs r$sentences_from_text_legacyz1PunktSentenceTokenizer.sentences_from_text_legacys2 &&t';';D'AB((v66r&rc#Ktjfd|D}g}|D]0}|j|j|js+|g}2|r|yyw)zw Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence. c3@K|]}j|ywr_rrs r$rz?PunktSentenceTokenizer.sentences_from_tokens..s+KqDKKN+KrN)rUrappendrr)r#rrrs` r$sentences_from_tokensz,PunktSentenceTokenizer.sentences_from_tokenssjd+++KF+KKL G OOGKK (     N s AA&A&cJ|j|}|j|}|S)z Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks. )r_annotate_second_passrfs r$rz'PunktSentenceTokenizer._annotate_tokenss-**62 ++F3  r&c#Kd}tjd}d}|D]}|j}|j||j }|t |z }|||t |z|k7rOdj d|D} tj| j||} | r| j }|||t |z|k(sJ|t |z }|r||z }||z }|js|d}|r|yyw)z Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings. rz\s*r.c3FK|]}tj|ywr_)r/r0)rcs r$rz>PunktSentenceTokenizer._build_sentence_list..s!<1"))A,!> ##DLL4N4N(N%)"88 SL0#33H=O%'%*" $ !EE@@  9,"((33H= I%*" $ IIr&rc|j|jvry|jj|j}|j r|t zr |tzsy|jr|tzs |tzsyy)zR Decide whether the given token is the first token in a sentence. FTr) r PUNCTUATIONrrdrrrrur _ORTHO_UCrw)r#rrds r$rz'PunktSentenceTokenizer._ortho_heuristicss ;;$** * 2273M3MN   *"]2    Y & 0Mr&rg)T)&rKrLrMrNr}rerrnrrrrrrrrr rcrrqrrrrrrrrrrrrtuplerrr rr rr!r&r$rjrjsDu  ;  HSHdHd3iH "C"HT#s(^,D"J59 2 2-1 2 %S/ " 259 T T-1 T c T s s H#H(5PSCT:UHT 4c 4huo 4$$!)%$ %$@ C D 7s7x}7z* * "x ';@T*66!)*!56 #6r ,8J/ ,D ,"/K z*  *  N"N.6z.BN #N` uT3Y7Gr&rjc&eZdZdZddZddZdZy)PunktTokenizerzU Punkt Sentence Tokenizer that loads/saves its parameters from/to data files cPtj||j|yr_)rjre load_lang)r#langs r$rezPunktTokenizer.__init__s''- tr&cVddlm}|d|d}t||_||_y)Nr)findztokenizers/punkt_tab//) nltk.datarload_punkt_paramsr_lang)r#rrlang_dirs r$rzPunktTokenizer.load_langs,"/vQ78(2  r&cLt|jd|jy)Nz/tmp/)dir)save_punkt_paramsrrr"s r$ save_paramszPunktTokenizer.save_paramss$,,eDJJ<,@Ar&N)english)rKrLrMrNrerrr!r&r$rrsBr&rc ddlm}|}t}t|dd5}t |j ||_dddt|dd5}|j||_dddt|dd5}|j||_ dddt|dd5}|j||_ ddd|S#1swYxYw#1swYuxYw#1swYRxYw#1swY|SxYw) Nr) PunktDecoder/collocations.tabzutf-8)encoding/sent_starters.txt/abbrev_types.txt/ortho_context.tab) nltk.tabdatarr]rr5tab2tupsratxt2setrbr` tab2intdictrd)rrpdecrfs r$rrs) >D  F  +,w ?41!$--"234  ,- @/A#||A/  +,w ?.1"ll1o.  ,- @3A#//23 M44//..3 Ms/ C C+C7>DC(+C47DD cddlm}ddlm}ddlm}||s|||}t |dd5}|j|j|jdddt |dd5}|j|j|jdddt |dd5}|j|j|jdddt |d d5}|j|j|jdddy#1swYxYw#1swYxYw#1swYexYw#1swYyxYw) Nr)mkdir)isdir) TabEncoderrrrrr)osros.pathrrrrrtups2tabraset2txtrbr` ivdict2tabrd)rrrrrtencrs r$rrsA' : c zdemo..s0"**^R\\:>>r1EMMdTWXr&TN)rSrrrqr)rtok_cls train_clscleanuptrainersbdrs r$demor sd Y kG"&G MM$ '$$& 'C++D1! gh !r&)z/tmp/punkt_tab)3rNr)r/r collectionsrtypingrrrrrr r r nltk.probabilityr nltk.tokenize.apirrtrurvrwrxryrrr$rrrrrrrrr<r@rr[r]r}rrrjrrrrrr r!r&r$rsvX| #KKK%( D A K D A K M )M 9 3 M )M 9 3((''('  *-;*Q'$N!-O*,N).3o+o+d :rzz2 F,33vDDX]]JT A>T Axo^ZodB+B("<2 *. !r&