JL i?l<dZdZddlZddlmZGddeZdZy)a Porter Stemmer This is the Porter stemming algorithm. It follows the algorithm presented in Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137. with some optional deviations that can be turned on or off with the `mode` argument to the constructor. Martin Porter, the algorithm's inventor, maintains a web page about the algorithm at https://www.tartarus.org/~martin/PorterStemmer/ which includes another Python implementation and other implementations in many languages. plaintextN)StemmerIceZdZdZdZdZdZefdZdZdZ dZ d Z d Z d Z d Zd ZdZdZdZdZdZdZdZdZddZdZy) PorterStemmeraY A word stemmer based on the Porter stemming algorithm. Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137. See https://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm. Martin Porter has endorsed several modifications to the Porter algorithm since writing his original paper, and those extensions are included in the implementations on his website. Additionally, others have proposed further improvements to the algorithm, including NLTK contributors. There are thus three modes that can be selected by passing the appropriate constant to the class constructor's `mode` attribute: - PorterStemmer.ORIGINAL_ALGORITHM An implementation that is faithful to the original paper. Note that Martin Porter has deprecated this version of the algorithm. Martin distributes implementations of the Porter Stemmer in many languages, hosted at: https://www.tartarus.org/~martin/PorterStemmer/ and all of these implementations include his extensions. He strongly recommends against using the original, published version of the algorithm; only use this mode if you clearly understand why you are choosing to do so. - PorterStemmer.MARTIN_EXTENSIONS An implementation that only uses the modifications to the algorithm that are included in the implementations on Martin Porter's website. He has declared Porter frozen, so the behaviour of those implementations should never change. - PorterStemmer.NLTK_EXTENSIONS (default) An implementation that includes further improvements devised by NLTK contributors or taken from other modified implementations found on the web. For the best stemming, you should use the default NLTK_EXTENSIONS version. However, if you need to get the same results as either the original algorithm or one of Martin Porter's hosted versions for compatibility with an existing implementation or dataset, you can use one of the other modes instead. NLTK_EXTENSIONSMARTIN_EXTENSIONSORIGINAL_ALGORITHMc T||j|j|jfvr td||_|j|jk(rFddgdgdgdgdgdd gd d gd d gdgdgdgdgd }i|_|D]}||D]}||j |<t gd|_y)NzwMode must be one of PorterStemmer.NLTK_EXTENSIONS, PorterStemmer.MARTIN_EXTENSIONS, or PorterStemmer.ORIGINAL_ALGORITHMskyskiesdyinglyingtyingnewsinningsinningoutingsoutingcanningscanninghoweproceedexceedsucceed) r dielietierrrrrrrr)aeiou)rrr ValueErrormodepool frozensetvowels)selfr$irregular_formskeyvals V/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/stem/porter.py__init__zPorterStemmer.__init__Vs    " "  # #  3   99,, , w'yyy$h/$h/& 2%;#*%; ODI& )*3/)C%(DIIcN) ) 9: cp|||jvry||dk(r|dk(ry|j||dz  Sy)aReturns True if word[i] is a consonant, False otherwise A consonant is defined in the paper as follows: A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term `consonant' is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel. FyrT)r' _is_consonant)r(wordr s r,r2zPorterStemmer._is_consonant~sI 7dkk ! 7c>Av--dAE:::r.cd}tt|D]}|j||r|dz }|dz }!|jdS)aReturns the 'measure' of stem, per definition in the paper From the paper: A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: CVCV ... C CVCV ... V VCVC ... C VCVC ... V These may all be represented by the single form [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as [C](VC){m}[V]. m will be called the \measure\ of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. cvvc)rangelenr2count)r(stem cv_sequencer s r,_measurezPorterStemmer._measures[D s4y! #A!!$*s" s"  #  &&r.c*|j|dkDS)Nrr>)r(r<s r,_has_positive_measurez#PorterStemmer._has_positive_measures}}T"Q&&r.c^tt|D]}|j||ryy)z1Returns True if stem contains a vowel, else FalseTF)r9r:r2)r(r<r s r,_contains_vowelzPorterStemmer._contains_vowels2s4y! A%%dA. r.cxt|dk\xr+|d|dk(xr|j|t|dz S)zjImplements condition *d from the paper Returns True if word ends with a double consonant r1r:r2r(r3s r,_ends_double_consonantz$PorterStemmer._ends_double_consonantsF IN 8RDH$ 8""4TQ7 r.ct|dk\xrh|j|t|dz xrH|j|t|dz  xr'|j|t|dz xr|ddvxsR|j|jk(xr7t|dk(xr'|j|d xr|j|dS)zImplements condition *o from the paper From the paper: *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). rEr1rF)wxr0r)r:r2r$rrIs r, _ends_cvczPorterStemmer._ends_cvcs IN 0""4TQ7 0&&tSY];; 0""4TQ7 0R/   II-- - ,D Q ,&&tQ// ,""4+ r.cj|j|sJd|dk(r||zS|dt| |zS)z-Replaces `suffix` of `word` with `replacementz(Given word doesn't end with given suffixr5N)endswithr:)r(r3suffix replacements r,_replace_suffixzPorterStemmer._replace_suffixsD}}V$P&PP$ R<+% %3v;,'+5 5r.c|D]q}|\}}}|dk(r+|j|r|dd}|||r||zcS|cS|j|sK|j||d}|||r||zcS|cS|S)aApplies the first applicable suffix-removal rule to the word Takes a word and a list of suffix-removal rules represented as 3-tuples, with the first element being the suffix to remove, the second element being the string to replace it with, and the final element being the condition for the rule to be applicable, or None if the rule is unconditional. *dNrGr5)rJrQrT)r(r3rulesrulerRrS conditionr<s r,_apply_rule_listzPorterStemmer._apply_rule_lists D-1 *FK~$"="=d"CCRy$ $+-- K}}V$++D&"=$ $+-- K " r.c|j|jk(r2|jdr!t|dk(r|j |ddS|j |gdS)aImplements Step 1a from "An algorithm for suffix stripping" From the paper: SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat iesie))ssesssN)r\r N)r`r`N)sr5N)r$rrQr:rTrZrIs r,_step1azPorterStemmer._step1as\ 99,, ,}}U#D Q++D%>>$$    r.c jjk(rE|jdr4t|dk(rj |ddSj |ddS|jdr.j |dd}j |dkDr|dzS|Sd }d D]<}|j|sj ||dj s:d }n|s|Sjd d dddfdfddfdfgS)a=Implements Step 1b from "An algorithm for suffix stripping" From the paper: (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4. iedr]r^r eedr5reeF)edingT)atateN)blbleN)izizeNrVrFcddvS)NrF)lraz)r<intermediate_stems r,z'PorterStemmer._step1b..xs!22!6o!Mr.rcRj|dk(xrj|SNr1)r>rOr<r(s r,rtz'PorterStemmer._step1b..~s#$--"5":"St~~d?Sr.)r$rrQr:rTr>rCrZ)r(r3r<rule_2_or_3_succeededrRrss` @r,_step1bzPorterStemmer._step1b/s<F 99,, ,}}U#t9>//eTBB//eSAA == ''eR8D}}T"Q&d{" %# F}}V$$($8$8vr$J!''(9:,0)  %K$$ ###%b)MT   r.cfd}fd}j|ddjjk(r|fgS|fgS)zImplements Step 1c from "An algorithm for suffix stripping" From the paper: Step 1c (*v*) Y -> I happy -> happi sky -> sky c`t|dkDxrj|t|dz S)a This has been modified from the original Porter algorithm so that y->i is only done when y is preceded by a consonant, but not if the stem is only a single consonant, i.e. (*c and not c) Y -> I So 'happy' -> 'happi', but 'enjoy' -> 'enjoy' etc This is a much better rule. Formerly 'enjoy'->'enjoi' and 'enjoyment'->'enjoy'. Step 1c is perhaps done too soon; but with this modification that no longer really matters. Also, the removal of the contains_vowel(z) condition means that 'spy', 'fly', 'try' ... stem to 'spi', 'fli', 'tri' and conflate with 'spied', 'tried', 'flies' ... r1rHrws r,nltk_conditionz-PorterStemmer._step1c..nltk_conditions-&t9q=LT%7%7c$i!m%L Lr.c&j|S)N)rCrws r,original_conditionz1PorterStemmer._step1c..original_conditions''- -r.r0r )rZr$r)r(r3r|r~s` r,_step1czPorterStemmer._step1csj M* .$$  99(<(<<'    0    r.cjjk(rUjdrDjj ddr"j j ddSddjf}ddjf}dd jfd d jfd d jfddjfddjfjj k(r|n|ddjfddjfddjfddjfddjfdd jfdd jfddjfddjfddjfd djfd!djfd"djfd#djfg}jjk(r6|jd$djf|jd%d&fd'fjjk(r|jd%d&jfj|S)(aImplements Step 2 from "An algorithm for suffix stripping" From the paper: Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible allir5alblirlabliableationalrjtionaltionencienceancianceizerrnentlientelirousliousizationationatoralismivenessivefulnessfulousnessalitiivitibilitifullilogilogc,jddS)N)rAr<r(r3s r,rtz&PorterStemmer._step2..sT-G-GSb -Rr.) r$rrQrArT_step2r appendrrZ)r(r3bli_rule abli_rulerWs`` r,rzPorterStemmer._step2s]< 99,, ,}}V$)C)C$$T626*{{4#7#7fd#KLL5$"<"<=VT%?%?@ t99 : vt99 : VT77 8 VT77 8 UD66 7d&=&==I8 T455 6 eT77 8 C33 4 eT77 8 t99 : eT77 8 UD66 7 dD66 7 t99 : t99 : t99 : dD66 7 eT77 8 ud88 9) . 99,, , LL'5$*D*DE F LL RS  99.. . LL&%)C)CD E$$T511r.c |j|dd|jfdd|jfdd|jfdd|jfdd|jfd d|jfd d|jfgS) aVImplements Step 3 from "An algorithm for suffix stripping" From the paper: Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good icateicativer5alizericitiicalrness)rZrArIs r,_step3zPorterStemmer._step3s$$ $ : :;"d889$ : :;$ : :;t99:D667T778   r.cfd}j|dd|fdd|fdd|fdd|fdd|fdd|fd d|fd d|fd d|fd d|fd d|fddfdfdd|fdd|fdd|fdd|fdd|fdd|fdd|fgS)aImplements Step 4 from "An algorithm for suffix stripping" Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler The suffixes are now removed. All that remains is a little tidying up. c,j|dkDSrvr@rws r,rtz&PorterStemmer._step4..=sDMM$$7!$;r.rr5rrerrribleantementmentrionc>j|dkDxr|ddvS)Nr1rF)ratr@rws r,rtz&PorterStemmer._step4..Qs#t!4q!8!ST"X=Sr.ouismrjitirrrnrZ)r(r3 measure_gt_1s` r,_step4zPorterStemmer._step4!s8< $$ r<(\*\*r<(r<(\*\*L)"l+\*L)S r<(L)L)L)L)L)L)1   r.c|jdrP|j|dd}|j|dkDr|S|j|dk(r|j|s|S|S)a=Implements Step 5a from "An algorithm for suffix stripping" From the paper: Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas rr5r1)rQrTr>rO)r(r3r<s r,_step5azPorterStemmer._step5a]s`8 == ''c26D}}T"Q& }}T"a't0D  r.c:jddfdfgS)aImplements Step 5a from "An algorithm for suffix stripping" From the paper: Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll llrpc2jdddkDS)NrFr1r@rs r,rtz'PorterStemmer._step5b..sDMM$s),Dq,Hr.rrIs``r,_step5bzPorterStemmer._step5bs($$ D#HIJ  r.c|r|jn|}|j|jk(r||jvr|j|S|j|jk7rt |dkr|S|j |}|j|}|j|}|j|}|j|}|j|}|j|}|j|}|S)zW :param to_lowercase: if `to_lowercase=True` the word always lowercase rE)lowerr$rr%r r:rbryrrrrrr)r(r3 to_lowercaser<s r,r<zPorterStemmer.stems ,tzz| 99,, ,1B99T? " 99// /CINK||D!||D!||D!{{4 {{4 {{4 ||D!||D! r.cy)Nzrr)r(s r,__repr__zPorterStemmer.__repr__s r.N)T)__name__ __module__ __qualname____doc__rrr r-r2r>rArCrJrOrTrZrbryrrrrrrr<rrrr.r,rrs2j(O+-+&;P*1'f'   *68 6R h0 dN2` 8: x"H  4!r.rcddlm}ddlm}|j }g}g}|j ddD]L}|j |D]6\}}|j||j|j|8Ndj|}tjdd|dzj}dj|} tjdd| dzj} td jd jdd jd dt| td jd jdd jd dt|tdy)z^ A demonstration of the porter stemmer on a sample from the Penn Treebank corpus. r)r<)treebankNrL z (.{,70})\sz\1\nz -Original-F*-z -Results-zF**********************************************************************)nltkr< nltk.corpusrrfileids tagged_wordsrjoinresubrstripprintcenterreplace) r<rstemmerorigstemmeditemr3tagresultsoriginals r,demorsN $  "G DG  "2A&/!..t4 /ID# KK  NN7<<- . // hhwGff]GWs];BBDGxx~HvvmWhn=DDFH ,  b ! ) )#s 3 ; ;C EF (O +  R ( (c 2 : :3 DE 'N (Or.)r __docformat__r nltk.stem.apirrrrrr.r,rs+( "O !HO !dr.