JL idZddlZddlmZddlmZddlmZddlm Z edZ e jedZ e je d ZGd d Zy) zLanguage Model VocabularyN)Counter)Iterable)singledispatch)chainc0tdt|)Nz/Unsupported type for looking up in vocabulary: ) TypeErrortypewordsvocabs X/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/lm/vocabulary.py_dispatched_lookuprs Ed5k]S TTc,tfd|DS)zcLook up a sequence of words in the vocabulary. Returns an iterator over looked up words. c36K|]}t|ywNr).0wr s r z_..s=!#Au-=s)tupler s `r _rs =u= ==rc&||vr|S|jS)z$Looks up one word in the vocabulary.) unk_label)wordr s r _string_lookuprs5=45eoo5rcXeZdZdZd dZedZdZdZdZ dZ d Z d Z d Z d Zy) Vocabularya Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items by comparing their counts to a cutoff value. - Adds a special "unknown" token which unseen words are mapped to. >>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd'] >>> from nltk.lm import Vocabulary >>> vocab = Vocabulary(words, unk_cutoff=2) Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary. >>> vocab['c'] 3 >>> 'c' in vocab True >>> vocab['d'] 2 >>> 'd' in vocab True Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved. >>> vocab['b'] 1 >>> 'b' in vocab False >>> vocab['aliens'] 0 >>> 'aliens' in vocab False Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts. >>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1) >>> "b" in vocab2 True The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in `len`. Note that while the number of keys in the vocabulary's counter stays the same, the items in the vocabulary differ depending on the cutoff. We use `sorted` to demonstrate because it keeps the order consistent. >>> sorted(vocab2.counts) ['-', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab2) ['-', '', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab.counts) ['-', 'a', 'b', 'c', 'd', 'r'] >>> sorted(vocab) ['', 'a', 'c', 'd'] In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called "unknown" items. By default it's "". >>> "" in vocab True We can look up words in a vocabulary using its `lookup` method. "Unseen" words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string. >>> vocab.lookup("a") 'a' >>> vocab.lookup("aliens") '' If given a sequence, it will return an tuple of the looked up words. >>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c']) ('', 'a', '', 'd', '', 'c') It's possible to update the counts after the vocabulary has been created. In general, the interface is the same as that of `collections.Counter`. >>> vocab['b'] 1 >>> vocab.update(["b", "b", "c"]) >>> vocab['b'] 3 Nc||_|dkrtd|||_t|_|j ||ydy)aCreate a new Vocabulary. :param counts: Optional iterable or `collections.Counter` instance to pre-seed the Vocabulary. In case it is iterable, counts are calculated. :param int unk_cutoff: Words that occur less frequently than this value are not considered part of the vocabulary. :param unk_label: Label for marking words not part of vocabulary. z)Cutoff value cannot be less than 1. Got: N)r ValueError_cutoffrcountsupdate)selfr$ unk_cutoffrs r __init__zVocabulary.__init__sK# >H UV V! i  f0F9b9rc|jS)ziCutoff value. Items with count below this value are not considered part of vocabulary. )r#r&s r cutoffzVocabulary.cutoffs||rcj|jj|i|td|D|_y)zWUpdate vocabulary counts. Wraps `collections.Counter.update` method. c3 K|]}dyw)r N)rrs r rz$Vocabulary.update..s(a(s N)r$r%sum_len)r& counter_argscounter_kwargss r r%zVocabulary.updates/  L;N;(4(( rct||S)aLook up one or more words in the vocabulary. If passed one word as a string will return that word or `self.unk_label`. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words. :param words: Word(s) to look up. :type words: Iterable(str) or str :rtype: generator(str) or str :raises: TypeError for types other than strings or iterables >>> from nltk.lm import Vocabulary >>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2) >>> vocab.lookup("a") 'a' >>> vocab.lookup("aliens") '' >>> vocab.lookup(["a", "b", "c", ["x", "b"]]) ('a', 'b', '', ('', 'b')) r)r&r s r lookupzVocabulary.lookups,"%..rcV||jk(r |jS|j|Sr)rr#r$r&items r __getitem__zVocabulary.__getitem__s%#t~~5t||L4;;t;LLrc&|||jk\S)zPOnly consider items with counts GE to cutoff as being in the vocabulary.)r+r6s r __contains__zVocabulary.__contains__sDzT[[((rc|tfdjDjrjgSgS)zKBuilding on membership check define how to iterate over vocabulary.c3,K|] }|vs| ywrr.)rr7r&s r rz&Vocabulary.__iter__..s :dTT\T :s )rr$rr*s`r __iter__zVocabulary.__iter__s; :dkk : $ T^^   13  rc|jS)z1Computing size of vocabulary reflects the cutoff.)r0r*s r __len__zVocabulary.__len__s yyrc|j|jk(xr4|j|jk(xr|j|jk(Sr)rr+r$)r&others r __eq__zVocabulary.__eq__sA NNeoo - , u||+ , u||+ rcdj|jj|j|jt |S)Nz/<{} with cutoff={} unk_label='{}' and {} items>)format __class____name__r+rlenr*s r __str__zVocabulary.__str__s4@GG NN # #T[[$..#d)  r)Nr z)rF __module__ __qualname____doc__r(propertyr+r%r4r8r:r=r?rBrHr.rr rr%sKWr:&)/0M)    rr)rKsys collectionsrcollections.abcr functoolsr itertoolsrrregisterrstrrrr.rr rTsy $$UUX&>'>S!6"6 u u r