JL iS@ddlZddlZ ddlZddlmZd\ZZd\ZZ dgZ GddeZ GddZ Gdd Z d d Zd d Zy#e$rYBwxYw)N) TokenizerI)rc `eZdZdZddededdedf dZd Zd Z d Z d Z d Z dZ dZdZdZy)TextTilingTokenizeraTokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns. The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned. :param w: Pseudosentence size :type w: int :param k: Size (in sentences) of the block used in the block comparison method :type k: int :param similarity_method: The method used for determining similarity scores: `BLOCK_COMPARISON` (default) or `VOCABULARY_INTRODUCTION`. :type similarity_method: constant :param stopwords: A list of stopwords that are filtered out (defaults to NLTK's stopwords corpus) :type stopwords: list(str) :param smoothing_method: The method used for smoothing the score plot: `DEFAULT_SMOOTHING` (default) :type smoothing_method: constant :param smoothing_width: The width of the window used by the smoothing method :type smoothing_width: int :param smoothing_rounds: The number of smoothing passes :type smoothing_rounds: int :param cutoff_policy: The policy used to determine the number of boundaries: `HC` (default) or `LC` :type cutoff_policy: constant >>> from nltk.corpus import brown >>> tt = TextTilingTokenizer(demo_mode=True) >>> text = brown.raw()[:4000] >>> s, ss, d, b = tt.tokenize(text) >>> b [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]  NrFc |ddlm}|jd}|jj t |jd=y)Nr) stopwordsenglishself) nltk.corpusr words__dict__updatelocals) r wksimilarity_methodr smoothing_methodsmoothing_widthsmoothing_rounds cutoff_policy demo_modes ^/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/tokenize/texttiling.py__init__zTextTilingTokenizer.__init__@s;   -! 2I VX& MM& !c|j}|j|}t|}djd|D}|j|}|j |}|D]3}|j D cgc]} | d|j vs| c} |_5|j||} |jtk(r|j|| } n7|jtk(r tdtd|jd|jtk(r|j!| } ntd|jd|j#| } |j%| }|j'|||}g}d}|D]}|dk(r |j)||||} ||kr|j)||d|s|g}|j*r| | | |fS|Scc} w) zZReturn a tokenized copy of *text*, where each "token" represents a separate topic.c3NK|]}tjd|s|yw)z [a-z\-' \n\t]N)rematch).0cs r z/TextTilingTokenizer.tokenize..^s# 2BA)FA s%%rz'Vocabulary introduction not implementedzSimilarity method z not recognizedzSmoothing method N)lower_mark_paragraph_breakslenjoin_divide_to_tokensequences wrdindex_listr _create_token_tablerBLOCK_COMPARISON_block_comparisonVOCABULARY_INTRODUCTIONNotImplementedError ValueErrorrDEFAULT_SMOOTHING_smooth_scores _depth_scores_identify_boundaries_normalize_boundariesappendr)r textlowercase_textparagraph_breaks text_length nopunct_textnopunct_par_breakstokseqstswi token_table gap_scores smooth_scores depth_scoressegment_boundariesnormalized_boundariessegmented_textprevbbs rtokenizezTextTilingTokenizer.tokenizeSs66t<.) ww %  "88F00> B-- Adnn1L B   ..w8JK   ! !%5 5//EJ  # #'> >%&OP P$T%;%;$>}lzHTextTilingTokenizer._block_comparison..blk_frq..sqtu}rc3&K|] }|d yw)rNrO)r#tsoccs rr%zITextTilingTokenizer._block_comparison..blk_frq..s5EuQx5s)filter ts_occurencessum)tokrQts_occsfreqrAs ` rblk_frqz6TextTilingTokenizer._block_comparison..blk_frqs04k#6F6T6TUG5W55DKrr)r\r\r\r )r(rangerindexmathsqrtZeroDivisionErrorr7)r r>rAr[rBnumgapscurr_gapscore_dividendscore_divisor_b1score_divisor_b2score window_sizer?b1b2ts ` rr.z%TextTilingTokenizer._block_comparisons   g,"g %HAN >N,.>E$&&1*$&l Gdff,,%0 "ff %,X -Ca-G(UV,%WXr"((XBX%,X\H{"AA GArNa$77  GArNa$77  8 &3CFV3V)WW   e $/ %2YX%  s9DD$/D)) D54D5c ttttj|dd|jdzS)z1Wraps the smooth function from the SciPy CookbookNr) window_len)listsmoothnumpyarrayr)r rBs rr3z"TextTilingTokenizer._smooth_scoress2 5;;z!}-$:N:NQR:R S  rcd}tjd}|j|}d}dg}|D]H}|j|z |kr|j |j|j}J|S)zNIdentifies indented text or line breaks as the beginning of paragraphsdz[ ]* [ ]* [ ]*r)r!compilefinditerstartr7)r r8 MIN_PARAGRAPHpatternmatches last_breakpbreakspbs rr'z*TextTilingTokenizer._mark_paragraph_breakss} **GH""4( # (BxxzJ&6rxxz*XXZ  (rc .|j}g}tjd|}|D]1}|j|j |j f3t dt||Dcgc]}t||z ||||zc}Scc}w)z3Divides the text into pseudosentences of fixed sizez\w+r) rr!rur7grouprvr]r( TokenSequence)r r8rr+ryr"is rr*z-TextTilingTokenizer._divide_to_tokensequencess FF ++fd+ AE  %++-!? @ A1c-0!4  !a%q1q5!9 :   s3Bc i}d}d}|j}t|}|dk(r t|}|D]} | jD]\} } | |kDrt|}|dz }| |kDr| |vr|| xj dz c_|| j |k7r"||| _|| xjdz c_|| j|k7r+||| _|| jj|dg|| jddxxdz cc<t| |dggdd|||| <|dz }|S#t$r}td|d}~wwxYw#t$rYwxYw)z#Creates a table of TokenTableFieldsrz7No paragraph breaks were found(text too short perhaps?)Nr) first_posrV total_count par_countlast_par last_tok_seq) __iter__next StopIterationr1r+rrrrrVr7TokenTableField) r token_sequences par_breaksrA current_parcurrent_tok_seqpb_itercurrent_par_breaker?wordr^s rr,z'TextTilingTokenizer._create_token_tables  %%' M  ! $(M! " !B!//  e"33,0M)#q(  "33;&%11Q61"4(11[@5@ D)2#D)33q83"4(55H9H D)6#D)77>>QR?ST#D)77;A>!C>(7"'(7';&<$%"#!,%4 )K%- > q OA !DM!  M %s) D3 E3 E < EE  EEc  |Dcgc]}d}}t|t|z }tj|}|jt k(r||z n||dz z t t|tt|}|jtt fd|}|D]I}d||d<|D]:} |d| dk7st| d|dz dks'|| ddk(s3d||d<<K|Scc}w)zJIdentifies boundaries at the peaks of similarity score differencesrg@c|dkDSrNrO)xcutoffs rrRz:TextTilingTokenizer._identify_boundaries..,s1Q4&=rr) rWr(rpstdrLCsortedzipr]reversernrUabs) r rDr boundariesavgstdev depth_tupleshpdtdt2rs @rr5z(TextTilingTokenizer._identify_boundariess"..Aa. .,#l"33 ,'    #5[F53;&Fc,c,6G0HIJ  &0,? @ *B !Jr!u  *qESVOCFRUN+a/"3q6*a/()Jr!u%  * *//s C=c |Dcgc]}d}}ttt|dzdd}|}||| D]B}|}||ddD] }||k\r|} n|} ||dD] }|| k\r|} n|| zd|zz ||<|dz }D|Scc}w)zzCalculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's scorerrr Nrr)minmaxr() r scoresrrDclipr^gapscorelpeakrgrpeaks rr4z!TextTilingTokenizer._depth_scores9s$**a* * 3s6{b(!,a0tTE* HE r * E>!E   E E>!E   #(%-!h,">L  QJE  1+s Bcvg}d\}}}d}|D]} |dz }| dvr |rd}|dz }| dvr|sd}|t|ks,|t||jz|jkDsS||dk(rJt|} |D]%} | t| |z kDrt| |z } | } %n |vr|j | |dz }|S)zSNormalize the boundaries identified to the original text's paragraph breaks)rrrFrz T)r(rrrr7) r r8rr:norm_boundaries char_count word_count gaps_seen seen_wordcharbest_fitbrbestbrs rr6z)TextTilingTokenizer._normalize_boundariesWs,3) J   D !OJw9! a 7"9 3z?*zI&/0i(A-"4yH."#c"z/&::'*2 ?';H%'F! " _4'..v6Q + .r)__name__ __module__ __qualname____doc__r-r2HCrrJr.r3r'r*r,r5r4r6rOrrrrs[%R **"&KZ$L $  0d:<rrc eZdZdZ ddZy)rz[A field in the token table holding parameters for each token, used later in the processNcd|jjt|jd=yNr )rrr)r rrVrrrrs rrzTokenTableField.__init__}s$ VX& MM& !r)rrrNrrrrrrOrrrrys! "rrceZdZdZddZy)rz3A token list with its original length and its indexNc|xs t|}|jjt|jd=yr)r(rrr)r r^r+original_lengths rrzTokenSequence.__init__s1)?S-? VX& MM& !rNrrOrrrrs 9"rrc|jdk7r td|j|kr td|dkr|S|dvr tdtjd|dz||dd z |d|d z|d | d z f}|d k(rtj |d }nt d |zd z}tj||jz |d}||dz | dzS)asmooth the data using a window with requested size. This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal. :param x: the input signal :param window_len: the dimension of the smoothing window; should be an odd integer :param window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman' flat window will produce a moving average smoothing. :return: the smoothed signal example:: t=linspace(-2,2,0.1) x=sin(t)+randn(len(t))*0.1 y=smooth(x) :see also: numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter TODO: the window parameter could be the window itself if an array instead of a string rz'smooth only accepts 1 dimension arrays.z1Input vector needs to be bigger than window size.)flathanninghammingbartlettblackmanzDWindow is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'r rrrdznumpy.z (window_len)same)mode) ndimr1sizerpr_onesevalconvolverW)rrmwindowsrys rroros6 vv{BCCvv LMMA~ KK R   QqTAj2o..1qu9qZKPRAR?S3SSTA JJz3 ' F"^3 4 q1557{AF3A Z!^zkAo ..rc:ddlm}ddlm}t d}||j dd}|j |\}}}}|jd|jd|jtt||d |jtt||d |jtt||d |jtt|||j|jy) Nr)pylab)brownT)ri'zSentence Gap indexz Gap Scores)labelzSmoothed Gap scoresz Depth scores) matplotlibrrrrrawrJxlabelylabelplotr]r(stemlegendshow)r8rrttrssrrIs rdemors ! t ,B |yy{6E"++d#KAr1a LL%& LL JJuSV}a|J4 JJuSW~r)>J? JJuSV}a~J6 JJuSV}a  LLN JJLr) rr)r_r!rp ImportErrornltk.tokenize.apirr-r/rrr2rrrrorrOrrrsz  ),0)) BC^*^B """""3/ly   sA AA