JL iv8dZddlZddlZddlZddlZddlZddlZddlZddlm Z ddl Z ddl m Z ddl mZddlmZddlmZdZej(eej*Zd Zej(eZhd Zhd Zd Z d!d ZdZd"dZdZd#dZ d$dZ! d%dZ" d&dZ#d$dZ$d$dZ%d'dZ&dZ'd!dZ(dZ)d$dZ*e+dk(rTddl,m-Z-ddl.m/Z/m0Z0ddl1m2Z2dd l3m4Z4m5Z5e0jlZ7e2e-jlZ8e/jlZ9e$e7yy)(z) Utility methods for Sentiment Analysis. N)deepcopy) CategorizedPlaintextCorpusReader)load)PunktTokenizer) EMOTICON_REz (?: ^(?:never|no|nothing|nowhere|noone|none|not| havent|hasnt|hadnt|cant|couldnt|shouldnt| wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint )$ ) | n'tz ^[.:;!?]$>.8):):*:3:>:D:P:]:b:p:};)<3=)=3=D=]=p8-D:'):-):-D:-P:-b:-p:^):^*:c):o)=-3=-D>:)>:P>;)X-DX-Px-Dx-p:'-):-))>:-)8DXDXPxDxp>:(:<:@:L:S:[:\:c:{;(=/=L=\:'(:-(:-/:-<:-[:-c>.<>:(>:/>:[>:\:'-(:-||cfd}|S)zH A timer decorator to measure execution performance of methods. c ntj} |i|}tj}||z }|dz}|dzdz}tt|dz}|dk(r3|dk(r.|dkr)td jd jdd|Std jd|d |d |d |S) Ni<r z[TIMER] z(): z.3fz secondszh zm s)timeintroundprint__name__) argskwstartresultendtot_timehoursminssecsmethods Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/sentiment/util.pytimedztimer..timeds $$iik;D 2~"5B'( A:$!)r HV__-T&//#1FhO P  HV__-T%4&4&J K )rcres` rdtimerrhs  LrfcZi}|r t|}|D]}|t|v|d|d<|S)a Populate a dictionary of unigram features, reflecting the presence/absence in the document of each of the tokens in `unigrams`. :param document: a list of words/tokens. :param unigrams: a list of words/tokens whose presence/absence has to be checked in `document`. :param handle_negation: if `handle_negation == True` apply `mark_negation` method to `document` before checking for unigram presence/absence. :return: a dictionary of unigram features {unigram : boolean}. >>> words = ['ice', 'police', 'riot'] >>> document = 'ice is melting due to global warming'.split() >>> sorted(extract_unigram_feats(document, words).items()) [('contains(ice)', True), ('contains(police)', False), ('contains(riot)', False)] contains()) mark_negationset)documentunigramshandle_negationfeatureswords rdextract_unigram_featsrssF"H *>(,H (=9TF!$%> Orfcfi}|D])}|tj|v|d|dd|dd<+|S)a, Populate a dictionary of bigram features, reflecting the presence/absence in the document of each of the tokens in `bigrams`. This extractor function only considers contiguous bigrams obtained by `nltk.bigrams`. :param document: a list of words/tokens. :param unigrams: a list of bigrams whose presence/absence has to be checked in `document`. :return: a dictionary of bigram features {bigram : boolean}. >>> bigrams = [('global', 'warming'), ('police', 'prevented'), ('love', 'you')] >>> document = 'ice is melting due to global warming'.split() >>> sorted(extract_bigram_feats(document, bigrams).items()) # doctest: +NORMALIZE_WHITESPACE [('contains(global - warming)', True), ('contains(love - you)', False), ('contains(police - prevented)', False)] rjrz - rk)nltkbigrams)rnrwrqbigrs rdextract_bigram_featsrysN"HV7;t||H?U7U9T!WISa 34V Orfct|s t|}|xrt|dttf}|r|d}n|}d}t |D]s\}}t j |r|r|r|r| }%||xxdz cc<3|rtj |r| }N|sQtj |rg||xxdz cc<u|S)a. Append _NEG suffix to words that appear in the scope between a negation and a punctuation mark. :param document: a list of words/tokens, or a tuple (words, label). :param shallow: if True, the method will modify the original document in place. :param double_neg_flip: if True, double negation is considered affirmation (we activate/deactivate negation scope every time we find a negation). :return: if `shallow == True` the method will modify the original document and return it. If `shallow == False` the method will return a modified document, leaving the original unmodified. >>> sent = "I didn't like this movie . It was bad .".split() >>> mark_negation(sent) ['I', "didn't", 'like_NEG', 'this_NEG', 'movie_NEG', '.', 'It', 'was', 'bad', '.'] rF_NEG)r isinstancetuplelist enumerate NEGATION_REsearchCLAUSE_PUNCT_RE)rndouble_neg_flipshallowlabeleddoc neg_scopeirrs rdrlrls" H%A:hqkE4=AGqkIS> 4   d # )M A&  ?11$7% I 55d; Ff F  Orfc tj|d5}d}|djtjdz }t |D]}t ||tr.||}|d|dz }t |D]}|d|d||d z }Dt ||tr|d|dz }||D] }|d|d z } t|d|d ||d z }|j|d d d y #1swYy xYw) z4 Write the output of an analysis to a file. atz *** z{} z%d/%m/%Y, %H:%Mz - **z:** z - : z  z:** N) codecsopenformatrUstrftimesortedr|dictr~write)filenamekwargsoutfiletextk dictionaryentrys rdoutput_markdownrs3 Xt $   /@!ABB 7A&)T*#AY &5))#J/EEfUG2j.?-@DDDEF1It,&5))#AY/EfUG2..D/&4q {#66 7  ds C C++C4ctjdtj||r|t|kDr t|}|dt d|z}|t d|z|}||fS)at Randomly split `n` instances of the dataset into train and test sets. :param all_instances: a list of instances (e.g. documents) that will be split. :param n: the number of instances to consider (in case we want to use only a subset). :return: two lists of instances. Train set is 8/10 of the total and test set is 2/10 of the total. i90Ng?)randomseedshufflelenrV) all_instancesn train_settest_sets rdsplit_train_testrsh KK NN=! C &&  nC!G -ISq\A.H h rfc ddlm}|jdd|j }|j j |j||dd|jd d |jd |r|j||d|r|jgd|d|jd|jy#t$r}td|d}~wwxYw)NrzQThe plot function requires matplotlib to be installed.See https://matplotlib.org/y)axisnbinsrored)colorg333333g333333?)yminymax)padvertical)rotation)rru horizontalg?)matplotlib.pyplotpyplot ImportErrorlocator_paramsaxesyaxisgridplotylim tight_layoutxticksyticksmarginsshow)x_valuesy_valuesx_labelsy_labelsplters rd _show_plotr*s'Cq) 88:DJJOOHHXxUH3HH$SH! 8X ; :x, ?KKHHJ%  *  sC C) C$$C)c tj||5} t||||\} }| j|| dk(rg}d}| D]U}t j |}t ||} ||jd}|dk(rtjd|rV|dk(rtjd|rr|dk(r!NM$4X$>!NM$~5O"]2N')O#2";";O"LM$MIM$66y6MM&&'<}&U+CC./eq/sRD&&&:'#11/BL--n=H &&w =J 113 &&x0G $3$C$CDq DD $J'00))22! 7N 0   Y  Es$ G9 G>8H)HHHc ddlm}ddlm}|t |dz }|j dd|Dcgc]}t |j|df }}|j dd|Dcgc]}t |j|df }}t|\} } t|\} } | | z} | | z}|}|j| }|j|d }|jt| |j| }|j|}|j||} |j|j%|}|rI|j&Dcgc]}|j(}}t+|d t-|j(d |||yycc}wcc}w#t $rt#d Y}wxYwcc}w)a Train classifier on all instances of the Movie Reviews dataset. The corpus has been preprocessed using the default sentence tokenizer and WordPunctTokenizer. Features are composed of: - most frequent unigrams :param trainer: `train` method of a classifier. :param n_instances: the number of total reviews that have to be used for training and testing. Reviews will be equally split between positive and negative. :param output: the output file where results have to be reported. r) movie_reviewsrNr rrrrr Movie_reviewsWordPunctTokenizerr)rrIr rrVfileidsr~wordsrr#r$r%rsr'r(r)r*rXr+r,rYrr-)r/r0r1rIrpos_idr8neg_idr7r9r:r;r< training_docs testing_docsr?r#r@rBrrCrDrErFs rddemo_movie_reviewsrT(s*0+/* $++E2!NM$4X$>!NM"^3M =0L')O))-8I$66y16MM&&'<}&U"11-@L--l;H &&w =J 113 &&x0G $3$C$CDq DD #J'00*! G2   Y  Es##F.#F"0F'!G'F>=F>c ddlm}ddlm}|t |dz }|j dd|Dcgc]}|df}}|j dd|Dcgc]}|df}}t |\} } t |\} } | | z} | | z}|}|j| Dcgc] }t|c}}|j|d }|jt| |j| }|j|}|j||} |j|j#|}|d k(r|j%|d|rH|j&Dcgc]}|j(}}t+|dt-|j(d||||Scc}wcc}wcc}w#t$rt!d YwxYwcc}w)a Train and test a classifier on instances of the Subjective Dataset by Pang and Lee. The dataset is made of 5000 subjective and 5000 objective sentences. All tokens (words and punctuation marks) are separated by a whitespace, so we use the basic WhitespaceTokenizer to parse the data. :param trainer: `train` method of a classifier. :param save_analyzer: if `True`, store the SentimentAnalyzer in a pickle file. :param n_instances: the number of total sentences that have to be used for training and testing. Sentences will be equally split between positive and negative. :param output: the output file where results have to be reported. r) subjectivityrNr subj) categoriesobjrJrKrrTsa_subjectivity.picklerVWhitespaceTokenizer)rrrrrr)rrVr rrVsentsrr#rlr$r%rsr'r(r)r*rXr+ save_filer,rYrr-)r/ save_analyzerr0r1rVrr subj_docsobj_docstrain_subj_docstest_subj_docstrain_obj_docs test_obj_docsrRrSr?r all_words_negr@rBrrCrDrErFs rddemo_subjectivityrfms)0+/* $0#5#5#5#H+#VvI#/"4"4"4"F| "Tu H '7y&A#O^$4X$>!NM#n4M!M1L')O#--'45s 5M $66}q6QM&&'<}&U#11-@L--l;H &&w =J 113 &&x0G!!/3KL $3$C$CDq DD "J'00+!  i 6   Y  Es) F F"!F'F,!G,GGc`ddlm}ddlm}|j } t d}|j|Dcgc]}|j}}t|j|y#t $r/tdtdt|jd}Y~wxYwcc}w) z Classify a single sentence as subjective or objective using a stored SentimentAnalyzer. :param text: a sentence whose subjectivity has to be classified. r)NaiveBayesClassifier)regexprZz4Cannot find the sentiment analyzer you want to load.z.Training a new one using NaiveBayesClassifier.TN) nltk.classifyrhr!rir[r LookupErrorrXrfr(rlowerclassify)rrhrirr?rrtokenized_texts rddemo_sent_subjectivityros3$//1NN780>/F/Ft/LMtdjjlMNM / " "> 23 N DE >?+,@,F,FMN Ns A0B+05B('B(cJddlm}ddlm}|j }d}d}|j |Dcgc]}|j }}ttt|} g} |D]e}||jvr|dz }| jd,||jvr|dz }| jdU| jdg||kDr tdn!||kr tdn||k(r td|d k(rt| | |gd  y y cc}w) a Basic example of sentiment classification using Liu and Hu opinion lexicon. This function simply counts the number of positive, negative and neutral words in the sentence and classifies it depending on which polarity is more represented. Words that do not appear in the lexicon are considered as neutral. :param sentence: a sentence whose polarity has to be classified. :param plot: if True, plot a visual representation of the sentence polarity. r)opinion_lexicon)treebankrurPositiveNegativeNeutralT)rtrurs)rrN)rrqr!rrTreebankWordTokenizerrrlr~rangerpositivernegativerXr) sentencerrqrrr2 pos_words neg_wordsrrtokenized_sentxrs rddemo_liu_hu_lexiconrs,&..0III/8/A/A(/KLtdjjlLNL U3~& '(A A ?++- - NI HHQK _--/ / NI HHRL HHQK9 j Y  j i  i t| q>4W -MsD cRddlm}|}t|j|y)z~ Output polarity scores for a text using Vader approach. :param text: a text whose polarity has to be evaluated. rSentimentIntensityAnalyzerN)r rrXpolarity_scores)rrvader_analyzers rddemo_vader_instancers" :/1N . ( ( ./rfc<ddlm}ddlm}ddlm}ddlm}ddlm}ddlm}ddl m }|t|d z }d d g} |jd } d} t| | | d||jd} d} t| | | d|t| d}t| d}t|\}}t|\}}||z}||z}|}|t }|t }g}g}t!}d}t#|D]\}\}}|j%|||j%||j'||j)|d} | dkDrd}!nd}!|dz }|j'|!||!j%|i}"|D]X}|||}#|#|"d<|||||}$|$|"d|d<|||||}%|%|"d|d<|||||}&|&|"d|d<Zt+|"D]}'t-|'d|"|'|rt/|dd||" y y )!z Classify 10000 positive and negative tweets using Vader approach. :param n_instances: the number of total tweets that have to be classified. :param output: the output file where results have to be reported. r) defaultdict)r)accuracy) f_measure) precision)recallrNr r rr r F)rrr rr)rrcompoundruAccuracyz Precision []zRecall [z F-measure [rVaderr)Approachrrr) collectionsrrr nltk.metricsrrrrr rrVr"rrrrmraddrrrrXr)(r0r1rr eval_accuracyeval_f_measureeval_precision eval_recallrrr3r4r5r6r8r7r9r:r;r<r=r>r gold_results test_resultsacc_gold_resultsacc_test_resultslabelsnumrrrscoreobservedmetrics_resultsaccuracy_scoreprecision_score recall_scoref_measure_scorer]s( rddemo_vader_tweetsrs(+68829+/* F^F#++,BCM(L! $++,BCM(L!  E:H E:H%5X$>!NM$4X$>!NM$~5O"]2N/1Ns#Ls#L UF C%n5 &=D% 5U"&..t4Z@ 19HH q)X""1% &OB&'79IJ&4 #(e)QR2A+eWA./"<#6 U8KL /;(5'+,(e)QR2A+eWA./B)6 ?623456 $!#  rf__main__) LinearSVC)MaxentClassifierrh)SklearnClassifier)rr)F)FF)N)NN) utf8replaceFTTTTTN)NNT)FNN):__doc__rrrrrrrUcopyrrvrr nltk.datarr!rnltk.tokenize.casualrNEGATIONcompileVERBOSEr CLAUSE_PUNCTrrrrhrsryrlrrrrrrGrTrfrorrrrY sklearn.svmrrjrrhnltk.classify.scikitlearnrnltk.twitter.commonrrr( naive_bayessvmmaxentrgrfrdrsH  8(, bjj2::.  "**\*/ b<428&R,(B  QjLP+fW tB JHV4.) X 0\ ~ z%D;@&,,K IK ( . .C  # #F rf