JL i tdZddlmZddlmZddlmZdZGddeZGdd eZ Gd d eZ y ) zSmoothing algorithms for language modeling. According to Chen & Goodman 1995 these should work with both Backoff and Interpolation. ) methodcaller) Smoothing)ConditionalFreqDistct|tr tdndtfd|j DS)zCount values that are greater than zero in a distribution. Assumes distribution is either a mapping with counts as values or an instance of `nltk.ConditionalFreqDist`. Nc|SN)counts W/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/lm/smoothing.pyz'_count_values_gt_zero..s5c3:K|]}|dkDsdyw)rNr ).0 dist_or_countas_counts r z(_count_values_gt_zero..s#8ORS8Ss) isinstancerrsumvalues) distributionrs @r _count_values_gt_zerorsG l$7 8 S  +224 rc4eZdZdZfdZdZdZdZxZS) WittenBellzWitten-Bell smoothing.c (t|||fi|yr )super__init__)self vocabularycounterkwargs __class__s r rzWittenBell.__init__'s W77rct|j|j|}|j|}d|z |z|fS)Ng?)countsfreq_gammarwordcontextalphagammas r alpha_gammazWittenBell.alpha_gamma*s= G$))$/ G$e u$e++rcxt|j|}|||j|jzz Sr )rr%rrr*n_pluss r r'zWittenBell._gamma/s7&t{{7';<$++g"6"8"8"::;;rcL|jjj|Sr r%unigramsr&rr)s r unigram_scorezWittenBell.unigram_score3{{##((..r __name__ __module__ __qualname____doc__rr-r'r5 __classcell__r#s@r rr$s 8, </rrc6eZdZdZdfd ZdZdZdZxZS)AbsoluteDiscountingz!Smoothing with absolute discount.c 6t|||fi|||_yr )rrdiscount)rr r!rAr"r#s r rzAbsoluteDiscounting.__init__:s W77  rct|j|||jz d|j|jz }|j |}||fS)Nr)maxr%rArr'r(s r r-zAbsoluteDiscounting.alpha_gamma>s\  G$T*T]]:A >kk'"$$& '  G$e|rct|j|}|j|z|j|jz Sr )rr%rArr/s r r'zAbsoluteDiscounting._gammaFs;&t{{7';< &$++g*>*@*@*BBBrcL|jjj|Sr r2r4s r r5z!AbsoluteDiscounting.unigram_scoreJr6r)g?r7r=s@r r?r?7s+!C/rr?cDeZdZdZdfd ZdZdZefdZxZ S) KneserNeyaKneser-Ney Smoothing. This is an extension of smoothing with a discount. Resources: - https://pages.ucsd.edu/~rlevy/lign256/winter2008/kneser_ney_mini_example.pdf - https://www.youtube.com/watch?v=ody1ysUTD7o - https://medium.com/@dennyc/a-simple-numerical-example-for-kneser-ney-smoothing-nlp-4600addf38b8 - https://www.cl.uni-heidelberg.de/courses/ss15/smt/scribe6.pdf - https://www-i6.informatik.rwth-aachen.de/publications/download/951/Kneser-ICASSP-1995.pdf c Dt|||fi|||_||_yr )rrrA_order)rr r!orderrAr"r#s r rzKneserNey.__init__[s% W77   rc4|j|\}}||z Sr )_continuation_counts)rr)word_continuation_count total_counts r r5zKneserNey.unigram_score`s#/3/H/H/N,&44rc |j|}t|dz|jk(r|||jfn|j ||\}}t ||j z d|z }|j t|z|z }||fS)Nrg)r%lenrIrrLrCrAr)rr)r* prefix_countsrMrNr+r,s r r-zKneserNey.alpha_gammads G, 7|a4;;.4 -//"3 4**49 - +dmm;SAKO 5m DD{Re|rcfd|jtdzjD}d\}}|D]$}|t||dkDz }|t |z }&||fS)aCount continuations that end with context and word. Continuations track unique ngram "types", regardless of how many instances were observed for each "type". This is different than raw ngram counts which track number of instances. c38K|]\}}|ddk(r|yw)rNr )r prefix_ngramr%r*s r rz1KneserNey._continuation_counts..vs,, $ fAB7* , s)rrr)r%rPitemsintr)rr)r* higher_order_ngrams_with_context#higher_order_ngrams_with_word_counttotalr%s ` r rLzKneserNey._continuation_countsos, (, CL14D(E(K(K(M, ( 6:2+U6 3F /3vd|a7G3H H / *62 2E 33E99r)g?) r8r9r:r;rr5r-tuplerLr<r=s@r rGrGNs#  5 27:rrGN) r;operatorr nltk.lm.apirnltk.probabilityrrrr?rGr rr r_s> "!0"//&/)/.1: 1:r