JL i= dZddlZddlmZddlZddlmZdZdZdZ dZ eed d d d d de df Z e de ge ddZ ejdZejeejej zej"zZejdZejdZddZddZGddeZdZdZ ddZy)a Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this: 1. The tuple REGEXPS defines a list of regular expression strings. 2. The REGEXPS strings are put, in order, into a compiled regular expression object called WORD_RE, under the TweetTokenizer class. 3. The tokenization is done by WORD_RE.findall(s), where s is the user-supplied string, inside the tokenize() method of the class TweetTokenizer. 4. When instantiating Tokenizer objects, there are several options: * preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons. * reduce_len. By default, it is set to False. It specifies whether to replace repeated character sequences of length 3 or greater with sequences of length 3. * strip_handles. By default, it is set to False. It specifies whether to remove Twitter handles of text used in the `tokenize` method. * match_phone_numbers. By default, it is set to True. It indicates whether the `tokenize` method should look for phone numbers. N)List) TokenizerIac (?: [<>]? [:;=8] # eyes [\-o\*\']? # optional nose [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth | [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth [\-o\*\']? # optional nose [:;=8] # eyes [<>]? | {}\[\]]+ # Run of non-space, non-()<>{}[] | # or \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (...(...)...) | \([^\s]+?\) # balanced parens, non-recursive: (...) )+ (?: # End with: \([^\s()]*?\([^\s()]+\)[^\s()]*?\) # balanced parens, one level deep: (...(...)...) | \([^\s]+?\) # balanced parens, non-recursive: (...) | # or [^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars ) | # OR, the following to match naked domains: (?: (?\s]+>z [\-]+>|<[\-]+z (?:@[\w_]+)z(?:\#+[\w_]+[\w\'_\-]*[\w_]+)z#[\w.+-]+@[\w-]+\.(?:[\w-]\.?)+[\w-]uR.(?: [🏻-🏿]?(?:‍.[🏻-🏿]?)+ | [🏻-🏿] )a (?:[^\W\d_](?:[^\W\d_]|['\-_])+[^\W\d_]) # Words with apostrophes or dashes. | (?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals. | (?:[\w_]+) # Words without apostrophes or dashes. | (?:\.(?:\s*\.){1,}) # Ellipsis dots. | (?:\S) # Everything else that isn't whitespace. z([^a-zA-Z0-9])\1{3,}z&(#?(x?))([^&;\s]+);zZ(?>> from nltk.tokenize.casual import _replace_html_entities >>> _replace_html_entities(b'Price: £100') 'Price: \xa3100' >>> print(_replace_html_entities(b'Price: £100')) Price: £100 >>> c|jd}|jdrU |jdr t|d}n t|d}d|cxkrdkrnnt|fjdSn>|vr|jd St j jj|}| t|Srd S|jd S#t$rd}Y0wxYw#ttf$rY7wxYw) Nr cp1252r) groupintr r ValueErrorhtmlentitiesname2codepointgetchr OverflowError)match entity_bodynumberkeepremove_illegals r_convert_entityz/_replace_html_entities.._convert_entityskk!n ;;q> ;;q> b1F b1F 6)T) &+228<<d"{{1~%]]1155kBF   6{"$r7Q7   .  s$AC: C+ C('C(+C=<C=)ENT_REsubr)r r'r(r r)s `` r_replace_html_entitiesr,s"888 ::otX'F GGrcbeZdZdZdZdZ d dZdedeefdZ e d dZ e d dZ y) TweetTokenizera Tokenizer for tweets. >>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--" >>> tknzr.tokenize(s0) # doctest: +NORMALIZE_WHITESPACE ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--'] Examples using `strip_handles` and `reduce_len parameters`: >>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!' >>> tknzr.tokenize(s1) [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!'] Nc<||_||_||_||_y)ae Create a `TweetTokenizer` instance with settings for use in the `tokenize` method. :param preserve_case: Flag indicating whether to preserve the casing (capitalisation) of text used in the `tokenize` method. Defaults to True. :type preserve_case: bool :param reduce_len: Flag indicating whether to replace repeated character sequences of length 3 or greater with sequences of length 3. Defaults to False. :type reduce_len: bool :param strip_handles: Flag indicating whether to remove Twitter handles of text used in the `tokenize` method. Defaults to False. :type strip_handles: bool :param match_phone_numbers: Flag indicating whether the `tokenize` method should look for phone numbers. Defaults to True. :type match_phone_numbers: bool N preserve_case reduce_len strip_handlesmatch_phone_numbers)selfr1r2r3r4s r__init__zTweetTokenizer.__init__Ls#.+$*#6 rr returncnt|}|jr t|}|jr t |}t j d|}|jr|jj|}n|jj|}|jsttd|}|S)zTokenize the input text. :param text: str :rtype: list(str) :return: a tokenized list of strings; joining this list returns the original string if `preserve_case=False`. \1\1\1cPtj|r|S|jS)N) EMOTICON_REsearchlower)xs rz)TweetTokenizer.tokenize..sK$6$6q$9qqwwyr)r,r3remove_handlesr2reduce_lengtheningHANG_REr+r4 PHONE_WORD_REfindallWORD_REr1listmap)r5r safe_textwordss rtokenizezTweetTokenizer.tokenizehs&d+   !$'D ??%d+DKK 40  # #&&..y9ELL((3E!!H5QE rc,t|jsktjddj t dtj tjztjzt|_t|jS)zCore TweetTokenizer regex(|)) type_WORD_REregexcompilejoinREGEXPSVERBOSEIUNICODEr5s rrEzTweetTokenizer.WORD_REsgDz"""'--CHHW%&a( '%--7#DJ Dz"""rc,t|jsktjddj t dtj tjztjzt|_t|jS)z#Secondary core TweetTokenizer regexrLrMrN) rO_PHONE_WORD_RErQrRrS REGEXPS_PHONErUrVrWrXs rrCzTweetTokenizer.PHONE_WORD_REsgDz(((- CHH]+,A. '%--7)DJ %Dz(((rTFFT)r7z regex.Pattern) __name__ __module__ __qualname____doc__rPrZr6strrrJpropertyrErCrrr.r.2se(HN  78ST#Y<##))rr.cPtjd}|jd|S)ze Replace repeated character sequences of length 3 or greater with sequences of length 3. z (.)\1{2,}r9)rQrRr+)r patterns rrArAs# mmL)G ;;y$ ''rc.tjd|S)z4 Remove Twitter username handles from text.  ) HANDLES_REr+)r s rr@r@s >>#t $$rc>t||||j|S)z: Convenience function for wrapping the tokenizer. r0)r.rJ)r r1r2r3r4s rcasual_tokenizerjs( ##/   htn r)Nstrict)rcTrr\)r`rtypingrrQnltk.tokenize.apir EMOTICONSURLSFLAGS PHONE_REGEXrTr[rRrBrUrVrWr;r*rhrr,r.rAr@rjrcrrrrs@  ($ $)f   $  (.   /" J[7712;7  %--/ 0emmIu}}uww'>'NO  . /U]]H 8H|h)Zh)`(% r