JL i^LdZ ddlZddZdZdZd dZd dZy#e$rYwxYw) a  Text Segmentation Metrics 1. Windowdiff Pevzner, L., and Hearst, M., A Critique and Improvement of an Evaluation Metric for Text Segmentation, Computational Linguistics 28, 19-36 2. Generalized Hamming Distance Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375 Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l'efficacite en segmentation de textes ? TALN 2009 3. Pk text segmentation metric Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210 Nct|t|k7r td|t|kDr tdd}tt||z dzD]Q}t||||zj |||||zj |z }|r||z }C|t d|z }S|t||z dzz S)aW Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> s1 = "000100000010" >>> s2 = "000010000100" >>> s3 = "100000010000" >>> '%.2f' % windowdiff(s1, s1, 3) '0.00' >>> '%.2f' % windowdiff(s1, s2, 3) '0.30' >>> '%.2f' % windowdiff(s2, s3, 3) '0.80' :param seg1: a segmentation :type seg1: str or list :param seg2: a segmentation :type seg2: str or list :param k: window width :type k: int :param boundary: boundary value :type boundary: str or int or bool :param weighted: use the weighted variant of windowdiff :type weighted: boolean :rtype: float z!Segmentations have unequal lengthzCWindow width k should be smaller or equal than segmentation lengthsr?)len ValueErrorrangeabscountmin)seg1seg2kboundaryweightedwdindiffs _/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/metrics/segmentation.py windowdiffr1s< 4yCI<==3t9} Q   B 3t9q=1$ % DQUO))(3d1q1uo6K6KH6UUV  %KB #a- B  TQ$ %%ctj||f}|tj|z|dddf<|tj|z|dddf<|S)Nr)npemptyarange)nrowsncolsins_costdel_costmats r _init_matr bsO ((E5> "C299U++C1I299U++C1I Jrc t|D]u\}}t|D]b\}} |t|| z z|||fz} || k(r|||f} n || kDr||||dzfz} n |||dz|fz} t| | ||dz|dzf<dwy)Nr) enumerater r ) rrowvcolvrrshift_cost_coeffrrowijcolj shift_costtcosts r_ghd_auxr+isT? 74  7GAt)Ct ,<>> # Same examples as Kulyukin C++ implementation >>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5) 0.5 >>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5) 2.0 >>> ghd('011', '110', 1.0, 1.0, 0.5) 1.0 >>> ghd('1', '0', 1.0, 1.0, 0.5) 1.0 >>> ghd('111', '000', 1.0, 1.0, 0.5) 3.0 >>> ghd('000', '111', 1.0, 2.0, 0.5) 6.0 :param ref: the reference segmentation :type ref: str or list :param hyp: the hypothetical segmentation :type hyp: str or list :param ins_cost: insertion cost :type ins_cost: float :param del_cost: deletion cost :type del_cost: float :param shift_cost_coeff: constant used to compute the cost of a shift. ``shift cost = shift_cost_coeff * |i - j|`` where ``i`` and ``j`` are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float rgr)r-)r"rr r+float) refhyprrr%rrvalref_idxhyp_idx nref_bound nhyp_boundrs rghdr6ys\"+3CXa3(?qCGC!*3CXa3(?qCGCWJWJQ:? aJ!OH$$ qZ!^H$$ JNJNHh GC S'7Hh8HI V DCs B8B8 B>B>cR|2ttt||j|dzz }d}t t||z dzD]A}||||zj|dkD}||||zj|dkD}||k7s=|dz }C|t||z dzz S)a Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> '%.2f' % pk('0100'*100, '1'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0'*400, 2) '0.50' >>> '%.2f' % pk('0100'*100, '0100'*100, 2) '0.00' :param ref: the reference segmentation :type ref: str or list :param hyp: the segmentation to evaluate :type hyp: str or list :param k: window size, if None, set to half of the average reference segment length :type boundary: str or int or bool :param boundary: boundary value :type boundary: str or int or bool :rtype: float @rrr)introundrr r)r/r0rrerrrrhs rpkr>s2 y c#h#))H"5";<= > C 3s8arrrDsC@  +&b 7 =F"&_  s ##