JL iDdZddlmZddlmZddlmZmZGddZy)z/ Language Model Counter ---------------------- ) defaultdict)Sequence)ConditionalFreqDistFreqDistc<eZdZdZd dZdZdZdZdZdZ d Z y) NgramCounteradClass for counting ngrams. Will count any ngram sequence you give it ;) First we need to make sure we are feeding the counter sentences of ngrams. >>> text = [["a", "b", "c", "d"], ["a", "c", "d", "c"]] >>> from nltk.util import ngrams >>> text_bigrams = [ngrams(sent, 2) for sent in text] >>> text_unigrams = [ngrams(sent, 1) for sent in text] The counting itself is very simple. >>> from nltk.lm import NgramCounter >>> ngram_counts = NgramCounter(text_bigrams + text_unigrams) You can conveniently access ngram counts using standard python dictionary notation. String keys will give you unigram counts. >>> ngram_counts['a'] 2 >>> ngram_counts['aliens'] 0 If you want to access counts for higher order ngrams, use a list or a tuple. These are treated as "context" keys, so what you get is a frequency distribution over all continuations after the given context. >>> sorted(ngram_counts[['a']].items()) [('b', 1), ('c', 1)] >>> sorted(ngram_counts[('a',)].items()) [('b', 1), ('c', 1)] This is equivalent to specifying explicitly the order of the ngram (in this case 2 for bigram) and indexing on the context. >>> ngram_counts[2][('a',)] is ngram_counts[['a']] True Note that the keys in `ConditionalFreqDist` cannot be lists, only tuples! It is generally advisable to use the less verbose and more flexible square bracket notation. To get the count of the full ngram "a b", do this: >>> ngram_counts[['a']]['b'] 1 Specifying the ngram order as a number can be useful for accessing all ngrams in that order. >>> ngram_counts[2] The keys of this `ConditionalFreqDist` are the contexts we discussed earlier. Unigrams can also be accessed with a human-friendly alias. >>> ngram_counts.unigrams is ngram_counts[1] True Similarly to `collections.Counter`, you can update counts after initialization. >>> ngram_counts['e'] 0 >>> ngram_counts.update([ngrams(["d", "e", "f"], 1)]) >>> ngram_counts['e'] 1 Nctt|_tx|jd<|_|r|j |yy)aKCreates a new NgramCounter. If `ngram_text` is specified, counts ngrams from it, otherwise waits for `update` method to be called explicitly. :param ngram_text: Optional text containing sentences of ngrams, as for `update` method. :type ngram_text: Iterable(Iterable(tuple(str))) or None N)rr_countsrunigramsupdate)self ngram_texts U/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/lm/counter.py__init__zNgramCounter.__init__Ys;##67 *2*4 Q$-  KK # c |D]}|D]~}t|ts$tdj|t |t |}|dk(r|j |dxxdz cc<b|dd|d}}||||xxdz cc<y)aRUpdates ngram counts from `ngram_text`. Expects `ngram_text` to be a sequence of sentences (sequences). Each sentence consists of ngrams as tuples of strings. :param Iterable(Iterable(tuple(str))) ngram_text: Text containing sentences of ngrams. :raises TypeError: if the ngrams are not tuples. z Ngram <{}> isn't a tuple, but {}r rN) isinstancetuple TypeErrorformattypelenr )rrsentngram ngram_ordercontextwords rr zNgramCounter.updateis 6D 6!%/#=DDUDQVKX"%j !#MM%(+q0+ %cr E"I[!'*40A50 6 6rcVtd|jjDS)a/Returns grand total number of ngrams stored. This includes ngrams from all orders, so some duplication is expected. :rtype: int >>> from nltk.lm import NgramCounter >>> counts = NgramCounter([[("a", "b"), ("c",), ("d", "e")]]) >>> counts.N() 3 c3<K|]}|jywN)N).0vals r z!NgramCounter.N..s)r __class____name__rr r#r)s r__str__zNgramCounter.__str__s58?? NN # #S%6  rc6|jjSr")r __len__r)s rr5zNgramCounter.__len__s||##%%rc||jvSr")r r.s r __contains__zNgramCounter.__contains__st||##rr") r2 __module__ __qualname____doc__rr r#r-r3r5r7rrrrs-DL$ 64 =H &$rrN) r: collectionsrcollections.abcrnltk.probabilityrrrr;rrr?s! $$:Q$Q$r