JL iCdZddlZddlZdZdZ ddZddZdZddZd Z d Z d Z d Z d Z dZdZdZddZdZedk(reyy)z Distance Metrics. Compute the distance between two items (usually strings). As metrics, they must satisfy the following three requirements: 1. d(a, a) = 0 2. d(a, b) >= 0 3. d(a, c) <= d(a, b) + d(b, c) Ncg}t|D]}|jdg|zt|D] }|||d< t|D] }||d|< |SNr)rangeappend)len1len2levijs [/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/nltk/metrics/distance.py_edit_dist_initr sq C 4[ A3: 4[Aq  4[Aq  Jc.|Dcic]}|dc}Scc}wr)sigmacs r _last_left_t_initr%s QAqD  s c ||dz } ||dz } ||dz |dz} |||dz dz} ||dz |dz | | k7r|ndz} | dz}|r'|dkDr"|dkDr||dz |dz |z|z |z|z dz }t| | | ||||<y)Nr)min)r r r s1s2 last_left last_rightsubstitution_costtranspositionsc1c2abrds r _edit_dist_stepr")s AEB AEB AE 1 A Aq1u A AE 1q5"(.BA AA)a-JN  A zA~ . 2Y > BZ ORS SAq!QCF1Irc t|}t|}t|dz|dz}t}|j||j|t |}t d|dzD]W} d} t d|dzD]6} ||| dz } | } || dz || dz k(r| } t || | ||| | || 8| ||| dz <Y|||S)a Calculate the Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming "rain" to "shine" requires three steps, consisting of two substitutions and one insertion: "rain" -> "sain" -> "shin" -> "shine". These operations could have been done in other orders, but at least three steps are needed. Allows specifying the cost of substitution edits (e.g., "a" -> "b"), because sometimes it makes sense to assign greater penalties to substitutions. This also optionally allows transposition edits (e.g., "ab" -> "ba"), though this is disabled by default. :param s1, s2: The strings to be analysed :param transpositions: Whether to allow transposition edits :type s1: str :type s2: str :type substitution_cost: int :type transpositions: bool :rtype: int rrrr)lenr setupdaterrr")rrrrrrr r last_left_tr last_right_bufr rrs r edit_distancer*?s 4 r7D r7D $(D1H -C EE LL LL$E*K 1dQh #q$(# A#Bq1uI.I'J!a%yBq1uI%!" "3-    "# Bq1uI%#& t9T?rcHtdz tddz }}||fg}||fdk7rc|dz |dz f|dz |f||dz fg}fd|D}t|tjd\}\}}|j ||f||fdk7rct t |S)Nrr)rrc3fK|](\}}|dk\r |dk\r||n td||ff*yw)rinfN)float).0r r r s r z'_edit_dist_backtrace..s@ 16a1fSVAY5>> from nltk.metrics import binary_distance >>> binary_distance(1,1) 0.0 >>> binary_distance(1,3) 1.0 ?rlabel1label2s r binary_distancerCsF"3++rct|j|t|j|z t|j|z S)z)Distance metric comparing set-similarity.)r%union intersectionr@s r jaccard_distancerGsF  V$ %F,?,?,G(H HC VM rct|j|}t|j|}t|}t|}||k(r||k(rd}n|t||k(rd}n |dkDrd}nd}d||z |zz S)aEDistance metric that takes into account partial agreement when multiple labels are assigned. >>> from nltk.metrics import masi_distance >>> masi_distance(set([1, 2]), set([1, 2, 3, 4])) 0.665 Passonneau 2006, Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation. rgq= ףp?rgQ?)r%rFrEr)rArBlen_intersection len_union len_label1 len_label2ms r masi_distancerNs6..v67FLL()IVJVJZJ2B$B  SZ8 8  A    )+a/ //rcF t||z dS#tdYyxYw)zKrippendorff's interval distance metric >>> from nltk.metrics import interval_distance >>> interval_distance(1,10) 81 Krippendorff 1980, Content Analysis: An Introduction to its Methodology z7non-numeric labels not supported with interval distanceN)powprintr@s r interval_distancerSs*I6F?A&&I GHs cfdS)z7Higher-order function to test presence of a given labelcd|v|vk(zS)Nr?rxylabels r zpresence..s  ;<rrrYs`r presencer\ s  =)absr%rVs r rZz%fractional_presence..sS#A,3Q<89UaZ=VEUVJW< %q.3U!^ 4< sSV|  =uA~ >< #a&LU!^: ;rrr[s`r fractional_presencer`s  <rc it|5}|D]V}|jjd\}}}t|g}t|g}t |t||g<X dddfdS#1swYxYw)N c$t||gS)N) frozenset)rWrXdatas r rZz!custom_distance..$sY1v./r)openstripsplitrdr.)fileinfilellabelAlabelBdistres @r custom_distanceros D d2 A!u1~!9"41   #  #  NNIy) 1 a5BqE> a N !| & F"#^q00G;<  rcd||zcxkrdks ntjtdt||}d}t ||D]\}}||k(r|dz }nn ||k(sn|||zd|z zzS)a The Jaro Winkler distance is an extension of the Jaro similarity in: William E. Winkler. 1990. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. Proceedings of the Section on Survey Research Methods. American Statistical Association: 354-359. such that: jaro_winkler_sim = jaro_sim + ( l * p * (1 - jaro_sim) ) where, - jaro_sim is the output from the Jaro Similarity, see jaro_similarity() - l is the length of common prefix at the start of the string - this implementation provides an upperbound for the l value to keep the prefixes.A common value of this upperbound is 4. - p is the constant scaling factor to overweigh common prefixes. The Jaro-Winkler similarity will fall within the [0, 1] bound, given that max(p)<=0.25 , default is p=0.1 in Winkler (1990) Test using outputs from https://www.census.gov/srd/papers/pdf/rr93-8.pdf from "Table 5 Comparison of String Comparators Rescaled between 0 and 1" >>> winkler_examples = [("billy", "billy"), ("billy", "bill"), ("billy", "blily"), ... ("massie", "massey"), ("yvette", "yevett"), ("billy", "bolly"), ("dwayne", "duane"), ... ("dixon", "dickson"), ("billy", "susan")] >>> winkler_scores = [1.000, 0.967, 0.947, 0.944, 0.911, 0.893, 0.858, 0.853, 0.000] >>> jaro_scores = [1.000, 0.933, 0.933, 0.889, 0.889, 0.867, 0.822, 0.790, 0.000] One way to match the values on the Winkler's paper is to provide a different p scaling factor for different pairs of strings, e.g. >>> p_factors = [0.1, 0.125, 0.20, 0.125, 0.20, 0.20, 0.20, 0.15, 0.1] >>> for (s1, s2), jscore, wscore, p in zip(winkler_examples, jaro_scores, winkler_scores, p_factors): ... assert round(jaro_similarity(s1, s2), 3) == jscore ... assert round(jaro_winkler_similarity(s1, s2, p=p), 3) == wscore Test using outputs from https://www.census.gov/srd/papers/pdf/rr94-5.pdf from "Table 2.1. Comparison of String Comparators Using Last Names, First Names, and Street Names" >>> winkler_examples = [('SHACKLEFORD', 'SHACKELFORD'), ('DUNNINGHAM', 'CUNNIGHAM'), ... ('NICHLESON', 'NICHULSON'), ('JONES', 'JOHNSON'), ('MASSEY', 'MASSIE'), ... ('ABROMS', 'ABRAMS'), ('HARDIN', 'MARTINEZ'), ('ITMAN', 'SMITH'), ... ('JERALDINE', 'GERALDINE'), ('MARHTA', 'MARTHA'), ('MICHELLE', 'MICHAEL'), ... ('JULIES', 'JULIUS'), ('TANYA', 'TONYA'), ('DWAYNE', 'DUANE'), ('SEAN', 'SUSAN'), ... ('JON', 'JOHN'), ('JON', 'JAN'), ('BROOKHAVEN', 'BRROKHAVEN'), ... ('BROOK HALLOW', 'BROOK HLLW'), ('DECATUR', 'DECATIR'), ('FITZRUREITER', 'FITZENREITER'), ... ('HIGBEE', 'HIGHEE'), ('HIGBEE', 'HIGVEE'), ('LACURA', 'LOCURA'), ('IOWA', 'IONA'), ('1ST', 'IST')] >>> jaro_scores = [0.970, 0.896, 0.926, 0.790, 0.889, 0.889, 0.722, 0.467, 0.926, ... 0.944, 0.869, 0.889, 0.867, 0.822, 0.783, 0.917, 0.000, 0.933, 0.944, 0.905, ... 0.856, 0.889, 0.889, 0.889, 0.833, 0.000] >>> winkler_scores = [0.982, 0.896, 0.956, 0.832, 0.944, 0.922, 0.722, 0.467, 0.926, ... 0.961, 0.921, 0.933, 0.880, 0.858, 0.805, 0.933, 0.000, 0.947, 0.967, 0.943, ... 0.913, 0.922, 0.922, 0.900, 0.867, 0.000] One way to match the values on the Winkler's paper is to provide a different p scaling factor for different pairs of strings, e.g. >>> p_factors = [0.1, 0.1, 0.1, 0.1, 0.125, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.20, ... 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] >>> for (s1, s2), jscore, wscore, p in zip(winkler_examples, jaro_scores, winkler_scores, p_factors): ... if (s1, s2) in [('JON', 'JAN'), ('1ST', 'IST')]: ... continue # Skip bad examples from the paper. ... assert round(jaro_similarity(s1, s2), 3) == jscore ... assert round(jaro_winkler_similarity(s1, s2, p=p), 3) == wscore This test-case proves that the output of Jaro-Winkler similarity depends on the product l * p and not on the product max_l * p. Here the product max_l * p > 1 however the product l * p <= 1 >>> round(jaro_winkler_similarity('TANYA', 'TONYA', p=0.1, max_l=100), 3) 0.88 rrzkThe product `max_l * p` might not fall between [0,1].Jaro-Winkler similarity might not be between 0 and 1.)warningswarnstrr|rs)rrpmax_ljaro_simrks1_is2_is r jaro_winkler_similarityrdst  Q  H  r2&H A"bk d 4< FA  :   q1uH - ..rc gd}|D]\}}td|d|dt||td|d|dt||dtd|d|dt||td |d|dt||td |d|dd t||z hd }hd }td|td|tdt ||tdt ||tdt ||y)N))rainshine)abcdefacbdef)languagelnaguaeg)rlnaugage)rlngauagezEdit distance btwn 'z' and 'z':z$Edit dist with transpositions btwn 'T)rzJaro similarity btwn 'zJaro-Winkler similarity btwn 'zJaro-Winkler distance btwn 'r>rrP>rrzs1:zs2:zBinary distance:zJaccard distance:zMASI distance:)rRr*r|rrCrGrN)string_distance_examplesrrs r demors0 + B $RDt26 b"8MN 22$gbT D "b 6  &rd'"R8/"b:QR ,RDt2 > #B +  *2$gbT < 'B/ /   B B % % ob"56 /B78 M"b12r__main__)rF)r)g?r)__doc__r2r~r rr"r*r:r<rCrGrNrSr\r`ror|rr__name__rrr rs ! SX ,<~%*3l , 08I"= 0: zt/n3@ zFr