JL i*TdZddlZddlmZGddeZej Zy)a S-Expression Tokenizer ``SExprTokenizer`` is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens. >>> from nltk.tokenize import SExprTokenizer >>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)'] By default, `SExprTokenizer` will raise a ``ValueError`` exception if used to tokenize an expression with non-matching parentheses: >>> SExprTokenizer().tokenize('c) d) e (f (g') Traceback (most recent call last): ... ValueError: Un-matched close paren at char 1 The ``strict`` argument can be set to False to allow for non-matching parentheses. Any unmatched close parentheses will be listed as their own s-expression; and the last partial sexpr with unmatched open parentheses will be listed as its own sexpr: >>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g'] The characters used for open and close parentheses may be customized using the ``parens`` argument to the `SExprTokenizer` constructor: >>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}') ['{a b {c d}}', 'e', 'f', '{g}'] The s-expression tokenizer is also available as a function: >>> from nltk.tokenize import sexpr_tokenize >>> sexpr_tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)'] N) TokenizerIceZdZdZddZdZy)SExprTokenizera\ A tokenizer that divides strings into s-expressions. An s-expresion can be either: - a parenthesized expression, including any nested parenthesized expressions, or - a sequence of non-whitespace non-parenthesis characters. For example, the string ``(a (b c)) d e (f)`` consists of four s-expressions: ``(a (b c))``, ``d``, ``e``, and ``(f)``. By default, the characters ``(`` and ``)`` are treated as open and close parentheses, but alternative strings may be specified. :param parens: A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings. :type parens: str or list :param strict: If true, then raise an exception when tokenizing an ill-formed sexpr. ct|dk7r td||_|d|_|d|_t j t j|ddt j|d|_y)Nz'parens must contain exactly two stringsr|) len ValueError_strict _open_paren _close_parenrecompileescape _paren_regexp)selfparensstricts Y/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tokenize/sexpr.py__init__zSExprTokenizer.__init__Ost v;! FG G !!9"1IZZyy#$Abiiq &:%; < cg}d}d}|jj|D]}|j}|dk(r4||||jj z }|j}||j k(r|dz }||j k(sp|jr!|dk(rtd|jztd|dz }|dk(s|j|||j|j}|jr|dkDrtd|z|t|kr|j||d|S)aQ Return a list of s-expressions extracted from *text*. For example: >>> SExprTokenizer().tokenize('(a b (c d)) e f (g)') ['(a b (c d))', 'e', 'f', '(g)'] All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.) If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the ``strict`` parameter to the constructor. If ``strict`` is ``True``, then raise a ``ValueError``. If ``strict`` is ``False``, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression: >>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g') ['c', ')', 'd', ')', 'e', '(f (g'] :param text: the string to be tokenized :type text: str or iter(str) :rtype: iter(str) rrz!Un-matched close paren at char %dz Un-matched open paren at char %dN) rfinditergroupstartsplitr rr r maxappendendr )rtextresultposdepthmparens rtokenizezSExprTokenizer.tokenizeYs66##,,T2 "AGGIEz$sQWWY/5577ggi((( )))<r/s2)V (PZPf !**r