`L i"dZddlZddlmZddlmZddlmZddlm Z m Z GddejZ Gd d eZ Gd d eZd ZdZy)zVarious noun phrase extractors.N)BaseNPExtractor)requires_nltk_corpus) PatternTagger)filter_insignificanttree2strc(eZdZdZedZdZy) ChunkParsercd|_yNF_trainedselfs _/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/textblob/en/np_extractors.py__init__zChunkParser.__init__  cztjjjddgDcgc]9}tjj |Dcgc] \}}}||f c}}};}}}}}tj |}tj|||_d|_ ycc}}}wcc}}}}w)z+Train the Chunker on the ConLL-2000 corpus.z train.txtNP) chunk_typesbackoffTN) nltkcorpus conll2000 chunked_sentschunktree2conlltags UnigramTagger BigramTaggertaggerr )rsent_tc train_dataunigram_taggers rtrainzChunkParser.trains  --;;$<  $(::#<#rA)r?r?)r?rA)DTCCzPRP$PRPNc6|st|_y||_yN)r parser)rrGs rrzConllExtractor.__init__>s+1km v rc tjj|}g}|D]}|j|}|Dcgc]}t |tj j r`|jdk(rMtt|dk\r6t||jrtt||j}}|Dcgc] }t|} }|j| |Scc}wcc}w)9Return a list of noun phrases (strings) for body of text.r)cfg)rtokenize sent_tokenize_parse_sentence isinstancetreeTreelabellenr _is_matchCFG_normalize_tagsINSIGNIFICANT_SUFFIXESrextend) rtext sentences noun_phrasesr.parsedeachphrasesphrasenpss rextractzConllExtractor.extractAsMM//5  ! %H))(3F #dDIINN3JJLD(,T23q8d1  4T4;V;V WXG3::8F#:C:    $ %;s B C6 C;cn|jj|}|jj|S)z4Tag and parse a sentence (a plain, untagged string).) POS_TAGGERr*rGr6)rr.taggeds rrNzConllExtractor._parse_sentenceUs+$$X.{{  ((rrF) r7r8r9__doc__rrcrUrWrrarNr:rrr<r<*s?J  C9>()rr<cBeZdZdZddddddZdZedZdZd Z y ) FastNPExtractorzA fast and simple noun phrase extractor. Credit to Shlomi Babluk. Link to original blog post: http://thetokenizer.com/2013/05/09/efficient-way-to-extract-the-main-topics-of-a-sentence/ r=r>r?r@cd|_yr r rs rrzFastNPExtractor.__init__lrrctjjjd}tjgd}tj ||}tj |||_d|_y)Nnews) categories) )z^-?[0-9]+(.[0-9]+)?$CD)z(-|:|;)$:)z\'*$MD)z(The|the|A|a|An|an)$AT)z.*able$r?)z ^[A-Z].*$r=)z.*ness$rA)z.*ly$RB)z.*s$NNS)z.*ing$VBG)z.*ed$VBD)z.*rArT) rrbrown tagged_sents RegexpTaggerrr r!r )rr& regexp_taggerr's rr(zFastNPExtractor.trainosf[[&&33v3F ))  ++J N'' NK  rc0tj|}|S)z+Split the sentence into single words/tokens)r word_tokenize)rr.tokenss r_tokenize_sentencez"FastNPExtractor._tokenize_sentences##H- rc2|js|j|j|}|jj |}t |}d}|rd}t dt|dz D]}||}||dz}|d|df} |jj| d} | s9d}|j||j||dd|d} | } |j|| | fn|r|D cgc]} | ddvs | d}} |Scc} w)rITFrrJ r=r>) r r(r{r!r*rVrangerSrUgetpopinsert)rr.rzrdtagsmergext1t2keyvaluematchr0r$matchess rrazFastNPExtractor.extracts(}} JJL((2(v&E1c$i!m, !W!a%[eRUl S"- EHHQKHHQK!!ugQr!ug.ECKKE3<0  "&@A1)?1Q4@@As < D DN) r7r8r9rerUrrr(r{rar:rrrgrg[sB  C. rrgc g}|D]\}}|dk(s|dk(r|j|df$|jdr|j||ddfL|jdr|j||ddft|j||f|S) zBNormalize the corpus tags. ("NN", "NN-PL", "NNS") -> "NN" zNP-TLrr=z-TLNS)appendendswith)rretr/r*s rrVrVs C  c '>SD[ JJe} %  <<  JJc#2h' (  <<  JJc#2h' (  D#;   Jrct|}d}|rd}tt|dz D]v}||||dz}}|d|df}|j|d}|s/d}|j ||j ||dd|d} |} |j || | fn|rt |D cgc] } | ddv c} } | Scc} w)zFReturn whether or not a tagged phrases matches a context-free grammar.TFrJNrr~r)listrrSrrrany) tagged_phraserKcopyrifirstsecondrrrr0r$s rrTrTs  D E s4y1}% A GT!a%[6E(F1I%CGGC&E   8*AfQi[1 As|,   6A1'6 7E L7s.C)rer textblob.basertextblob.decoratorsrtextblob.taggersrtextblob.utilsrr ChunkParserIr r<rgrVrTr:rrrsO% )4*99$##9>.)_.)bJoJ`&r