JL iTdZddlmZddlmZddlmZGddZdZdZ d d Z y ) a Simple classifier for RTE corpus. It calculates the overlap in words and named entities between text and hypothesis, and also whether there are words / named entities in the hypothesis which fail to occur in the text, since this is an indicator that the hypothesis is more informative than (i.e not entailed by) the text. TO DO: better Named Entity classification TO DO: add lemmatization )MaxentClassifier)accuracy)RegexpTokenizercHeZdZdZddZd dZd dZedZedZ y) RTEFeatureExtractorz This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference. cH||_hd|_hd|_td}|j |j |_|j |j|_t|j |_ t|j|_ |r\|j Dchc]}|j|c}|_ |jDchc]}|j|c}|_ |jr<|j|jz |_ |j|jz |_ |j|jz|_ |j|jz |_|j|jz |_ycc}wcc}w)z :param rtepair: a ``RTEPair`` from which features should be extracted :param stop: if ``True``, stopwords are thrown away. :type stop: bool >ainisitoftoandarethehavetheyverywere,.>nonotneverdeniedfailedrejectedz[\w.@:/]+|\w+|\$[\d.]+N)stop stopwordsnegwordsrtokenizetext text_tokenshyp hyp_tokensset text_words hyp_words _lemmatize_overlap _hyp_extra _txt_extra)selfrtepairr use_lemmatize tokenizertokens `/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/classify/rte_classify.py__init__zRTEFeatureExtractor.__init__s3   $O $$=> %--gll;#,,W[[9d../T__- CGCSCST%tu5TDOBF//Rdooe4RDN 99"oo>DO!^^dnnA>c|jDchc]}|j|s|}}|dk(r|S|dk(r|j|z Std|zcc}w)z Compute the extraneous material in the hypothesis. :param toktype: distinguish Named Entities from ordinary words :type toktype: 'ne' or 'word' r5r6zType not recognized: '%s')r+r7r9)r-r:r;r1ne_extras r2 hyp_extrazRTEFeatureExtractor.hyp_extrabs_(,Je$((5/EJJ d?O  ??X- -87BC C Ks AAcF|js|jryy)zz This just assumes that words in all caps or titles are named entities. :type token: str TF)istitleisupper)r1s r2r7zRTEFeatureExtractor._neqs ==?emmocTddlm}|j||j}||S|S)zI Use morphy from WordNet to find the base form of verbs. r)wordnet)pos) nltk.corpusrFmorphyVERB)r6wnlemmas r2r)zRTEFeatureExtractor._lemmatize}s- . $BGG ,  L rDN)TF)F)T) __name__ __module__ __qualname____doc__r3r=r@ staticmethodr7r)rDr2rrsA .;`C& D    rDrct|}i}d|d<t|jd|d<t|jd|d<t|jd|d<t|jd|d<t|j|j z|d <t|j|j z|d <|S) NTalwaysonr6 word_overlapword_hyp_extrar5r< ne_hyp_extraneg_txtneg_hyp)rlenr=r@r r'r()r. extractorfeaturess r2 rte_featuresr]s#G,IHHZ"9#4#4V#<=H^!$Y%8%8%@!AH  !2!24!89H\"9#6#6t#<=H^i0093G3GGHHYi0093F3FFGHY OrDcV|Dcgc]}t||jfc}Scc}wN)r]value) rte_pairspairs r2 rte_featurizercs$9B C\$  , CC Cs&Ncddlm}|jgd}|jgd}| |d|}|d|}t|}t|}t d|dvrt j ||}n1|dvrt j ||}ntd}t|t d t||} t d | z|S) Nr)rte)z rte1_dev.xmlz rte2_dev.xmlz rte3_dev.xml)z rte1_test.xmlz rte2_test.xmlz rte3_test.xmlzTraining classifier...)megam)GISIISzFRTEClassifier only supports these algorithms: 'megam', 'GIS', 'IIS'. zTesting classifier...zAccuracy: %6.4f) rHrepairsrcr8rtrainstr Exceptionr) algorithmsample_N rte_corpus train_settest_setfeaturized_train_setfeaturized_test_setclferr_msgaccs r2rte_classifierrws-  !QRI STHix( IX&(3'1 "#I$$%99E n $$$%99E '    !" 3+ ,C c !" JrDr_) rPnltk.classify.maxentrnltk.classify.utilr nltk.tokenizerrr]rcrwrRrDr2r{s2 2')nnb DrD