JL i8~dZddlZddlmZmZmZmZddlm Z m Z ddl m Z ddl mZGddZGd d eZGd d eZGd deZddZedk(rEddlZddlmZ edej.dzZ edej.dzZeeegdZy#e$rdZY0wxYw#e$rdZY$wxYw)a Tools to identify collocations --- words that often appear consecutively --- within corpora. They may also be used to find other associations between word occurrences. See Manning and Schutze ch. 5 at https://nlp.stanford.edu/fsnlp/promo/colloc.pdf and the Text::NSP Perl package at http://ngram.sourceforge.net Finding collocations requires first calculating the frequencies of words and their appearance in the context of other words. Often the collection of words will then requiring filtering to only retain useful content terms. Each ngram of words may then be scored according to some association measure, in order to determine the relative likelihood of each ngram being a collocation. The ``BigramCollocationFinder`` and ``TrigramCollocationFinder`` classes provide these functionalities, dependent on being provided a function which scores a ngram given appropriate frequency counts. A number of standard association measures are provided in bigram_measures and trigram_measures. N)BigramAssocMeasuresContingencyMeasuresQuadgramAssocMeasuresTrigramAssocMeasures)ranks_from_scoresspearman_correlation)FreqDist)ngramsceZdZdZdZe ddZedZedZ dfdZ d Z d Z d Z d Zd ZdZdZy)AbstractCollocationFindera An abstract base class for collocation finders whose purpose is to collect collocation candidate frequencies, filter and rank them. As a minimum, collocation finders require the frequencies of each word in a corpus, and the joint frequency of word tuples. This data should be provided through nltk.probability.FreqDist objects or an identical interface. cJ||_|j|_||_yN)word_fdNngram_fd)selfrrs W/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/collocations.py__init__z"AbstractCollocationFinder.__init__:s   Nc|f|dz z|r(tjjfd|DS|r(tjjfd|DSy)zU Pad the document with the place holder according to the window_size c3JK|]}tj|ywr _itertoolschain.0docpaddings r zAAbstractCollocationFinder._build_new_documents..Hs#236   g.2 #c3JK|]}tj|ywrrrs rr zAAbstractCollocationFinder._build_new_documents..Ls#236   #.2r!N)rr from_iterable)cls documents window_sizepad_left pad_right pad_symbolrs @r_build_new_documentsz.AbstractCollocationFinder._build_new_documents?sn-;?3 ##112:C2  ##112:C2  rc\|j|j||jdS)zConstructs a collocation finder given a collection of documents, each of which is a list (or iterable) of tokens. Tr() from_wordsr* default_ws)r$r%s rfrom_documentsz(AbstractCollocationFinder.from_documentsPs. ~~  $ $Y$ $ O  rcZtfdttdz DS)Nc3@K|]}t||zywr)tuple)rinwordss rr z.\s!OAeAA./Osr)r rangelen)r5r4s``r_ngram_freqdistz)AbstractCollocationFinder._ngram_freqdistZs!Os5zA~9NOOOrcy)NF)ngramfreqs rz"AbstractCollocationFinder.^srct}|jjD]\}}|||r|||<||_y)zGeneric filter removes ngrams from the frequency distribution if the function returns True when passed an ngram tuple. N)r ritems)rfn tmp_ngramr;r<s r _apply_filterz'AbstractCollocationFinder._apply_filter^sGJ ==..0 (KE4eT?#' %  (" rc.|jfdy)zARemoves candidate ngrams which have frequency less than min_freq.c|kSrr:)ngr<min_freqs rr=z=AbstractCollocationFinder.apply_freq_filter..js D8OrNrB)rrFs `rapply_freq_filterz+AbstractCollocationFinder.apply_freq_filterhs ;AbstractCollocationFinder.apply_ngram_filter..ps RrNrGrr@s `rapply_ngram_filterz,AbstractCollocationFinder.apply_ngram_filterls 01rc.|jfdy)zmRemoves candidate ngrams (w1, w2, ...) where any of (fn(w1), fn(w2), ...) evaluates to True. c,tfd|DS)Nc3.K|] }|ywrr:)rwr@s rr zPAbstractCollocationFinder.apply_word_filter....vs,?qRU,?s)anyrKs rr=z=AbstractCollocationFinder.apply_word_filter..vs,?B,?)?rNrGrMs `rapply_word_filterz+AbstractCollocationFinder.apply_word_filterrs ?@rc#fK|jD]}|j|g|}|||f yw)zbGenerates of (ngram, score) pairs as determined by the scoring function provided. N)r score_ngram)rscore_fntupscores r _score_ngramsz'AbstractCollocationFinder._score_ngramsxsB== !C$D$$X44E 5j  !s%1 1c<t|j|dS)zReturns a sequence of (ngram, score) pairs ordered from highest to lowest score, as determined by the scoring function provided. c|d |dfS)Nrrr:)ts rr=z8AbstractCollocationFinder.score_ngrams..sAaD5!A$-r)key)sortedrZ)rrWs r score_ngramsz&AbstractCollocationFinder.score_ngramssd((28OPPrc\|j|d|Dcgc]\}}| c}}Scc}}w)z;Returns the top n ngrams when scored by the given function.Nr`)rrWr4pss rnbestzAbstractCollocationFinder.nbests,"//9"1=>da>>>s (c#TK|j|D]\}}||kDr|yyw)z}Returns a sequence of ngrams, ordered by decreasing score, whose scores each exceed the given minimum score. Nrb)rrW min_scorer;rYs r above_scorez%AbstractCollocationFinder.above_scores6!--h7 LE5y    s&()FFN)__name__ __module__ __qualname____doc__r classmethodr*r/ staticmethodr8rBrHrNrTrZr`rerhr:rrr r /s|! QU   PP 9"=2 A !Q ?rr c4eZdZdZdZddZeddZdZy)BigramCollocationFinderzA tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly. c@tj|||||_y)zConstruct a BigramCollocationFinder, given FreqDists for appearances of words and (possibly non-contiguous) bigrams. N)r rr&)rr bigram_fdr&s rrz BigramCollocationFinder.__init__s "**4)D&rct}t}|dkr tdt||dD]3}|d}| ||xxdz cc<|ddD]}||||fxxdz cc<5||||S) zConstruct a BigramCollocationFinder for all bigrams in the given sequence. When window_size > 2, count non-contiguous bigrams, in the style of Church and Hanks's (1990) association ratio. rqzSpecify window_size at least 2Tr,rNr)r&)r ValueErrorr )r$r5r&wfdbfdwindoww1w2s rr-z"BigramCollocationFinder.from_wordss jj ?=> >UK4@ 'FBz GqLGQRj '>RMQ&M '  '355rc|j}|j||f|jdz z }|sy|j|}|j|}||||f|S)zReturns the score for a given bigram using the given scoring function. Following Church and Hanks (1990), counts are scaled by a factor of 1/(window_size - 1). g?N)rrr&r)rrWryrzn_alln_iin_ixn_xis rrVz#BigramCollocationFinder.score_ngramsd }}b"X&$*:*:S*@A ||B||BtTlE22rN)rq rirjrkrlr.rrmr-rVr:rrrprps, J'66* 3rrpc8eZdZdZdZdZeddZdZdZ y) TrigramCollocationFinderzA tool for the finding and ranking of trigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly. cNtj|||||_||_y)zConstruct a TrigramCollocationFinder, given FreqDists for appearances of words, bigrams, two words with any word between them, and trigrams. N)r r wildcard_fdrs)rrrsr trigram_fds rrz!TrigramCollocationFinder.__init__s% "**4*E&"rc|dkr tdt}t}t}t}t||dD]l}|d}| tj|dddD]F\} } ||xxdz cc<| ||| fxxdz cc<| (||| fxxdz cc<||| | fxxdz cc<Hn|||||S) z]Construct a TrigramCollocationFinder for all trigrams in the given sequence. rzSpecify window_size at least 3Tr,rNrrqrur r r combinations) r$r5r&rvwildfdrwtfdrxryrzw3s rr-z#TrigramCollocationFinder.from_wordss ?=> >jjjUK4@ 'FBz$11&*a@ 'BB1 :RH " :Bx A% RRL!Q&! ' '3VS))rcBt|j|jS)zConstructs a bigram collocation finder with the bigram and unigram data from this finder. Note that this does not include any filtering applied to this finder. )rprrs)rs r bigram_finderz&TrigramCollocationFinder.bigram_finders 't||T^^DDrc&|j}|j|||f}|sy|j||f}|j||f}|j||f} |j|} |j|} |j|} ||||| f| | | f|S)zXReturns the score for a given trigram using the given scoring function. N)rrrsrr) rrWryrzrr|n_iiin_iixn_ixin_xiin_ixxn_xixn_xxis rrVz$TrigramCollocationFinder.score_ngrams r2rl+ Bx(  "b*Bx( R  R  R ue4ueU6KUSSrN)r) rirjrkrlr.rrmr-rrVr:rrrrs3 J#**4ETrrc2eZdZdZdZdZeddZdZy)QuadgramCollocationFinderzA tool for the finding and ranking of quadgram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly. c tj|||||_||_||_||_||_||_y)zConstruct a QuadgramCollocationFinder, given FreqDists for appearances of words, bigrams, trigrams, two words with one word and two words between them, three words with a word between them in both variations. N)r riiiiiixiixxiiixiixii) rr quadgram_fdrrrrrrs rrz"QuadgramCollocationFinder.__init__s@ "**4+F   rc b|dkr tdt}t}t}t}t}t}t} t} t||dD]} | d} |  tj| dddD]\} }}|| xxdz cc<| || | fxxdz cc<|)|| | |fxxdz cc<|| |fxxdz cc<|K|| | ||fxxdz cc<|| |fxxdz cc<| | ||fxxdz cc<| | | |fxxdz cc<|||||||| | S)NrzSpecify window_size at least 4Tr,rrrr)r$r5r&ixxxiiiirrrrrrrxryrzrw4s rr-z$QuadgramCollocationFinder.from_words!so ?=> >zz ZjjzzzUK4@ (FBz(55fQRj!D ( BRA :B8 ! :RRL!Q&!RH " :b"b"%&!+&b"X!#b"b\"a'"b"b\"a'" (  ((4r3T4>>rc N|j}|j||||f}|sy|j|||f}|j|||f} |j|||f} |j|||f} |j ||f} |j ||f} |j ||f}|j ||f}|j||f}|j ||f}|j|}|j|}|j|}|j|}|||| | | f| |||| |f||||f|Sr) rrrrrrrrr)rrWryrzrrr|n_iiiin_iiixn_xiiin_iixin_ixiin_iixxn_xxiin_xiixn_ixixn_ixxin_xixin_ixxxn_xixxn_xxixn_xxxis rrVz%QuadgramCollocationFinder.score_ngramDsWBB/0 2r2,'2r2,'BB<(BB<("b""b""b"2r(#B8$2r(#b!b!b!b!  VVV , VVVVV < VVV ,    rN)rrr:rrrr s-J  ? ?D rrc ddlm}m}m}| |j}| |j }ddlm}m}|jdfd}|jD]}|j|D cgc]} | j} } tj| } | jd| j|||| j!||| j!|} t#|t#d| j%|d D cgc]} d j'| c} t#d |j(d | d ycc} wcc} w)z=Finds bigram collocations in the files of the WebText corpus.r)rrrN) stopwordswebtextenglishcHt|dkxs|jvS)Nr)r7lower)rR ignored_wordss rr=zdemo..rsCFQJD!'')}*Drr  z Correlation to z: z0.4f) nltk.metricsrrrlikelihood_ratioraw_freq nltk.corpusrrr5fileidsrrpr-rHrTr`printrejoinri)scorercompare_scorerrrrrr word_filterfilewordr5cfcorrrXrs @rdemorbs1 ~$55,55.OOI.MDK! K*1--*=>$>> $ / / 6 Q [)# boof5 6 boon= >  d  dbhhvr.BCsSXXc]CD ">#:#:";2d4[IJ K>Ds ,EE __main__)rzBigramAssocMeasures.rrq)rprr)NN)rl itertoolsrrrrrrnltk.metrics.spearmanrrnltk.probabilityr nltk.utilr r rprrrrisysevalargvr IndexErrorr__all__r:rrrs2 J%ddN03703fAT8ATHR 9R jKL z0,sxx{:;4sxx{BC     s$&B%?B2%B/.B/2B<;B<