`L iK dZddlZddlZddlmZddlmZddlm Z ddl m Z m Z m Z ddlmZddlmZdd lmZdd lmZmZdd lmZmZdd lmZdd lmZddlmZm Z ddl!m"Z"m#Z#edd\Z$Z%ee$e%d\Z$Z%ejMe$Z$gdZ'dhejPDchc] \}}|d c}}zZ)dQdZ*ejVjYdedZ-dZ.ejVjYdge#e"dZ/dZ0ejVjYde'ejVjYd ed!Z1d"Z2ejVjYd#d$d%Z3d&Z4d'Z5d(Z6d)Z7ejVjYd*d+d,gd-Z8ejVjYd.e#d/Z9ejVjYd0e'd1Z:d2Z;d3ZZCd?ZDejVjYd@dAdBgejVjYdCddDgdEZEdFZFejVjYdGdHdIgdJZGejVjYdKdLdMgdNZHejVjYdOd+d,gdPZIycc}}w)RzF Tests for HDBSCAN clustering algorithm Based on the DBSCAN test code N)stats)distance)HDBSCAN)CONDENSED_dtype_condense_tree _do_labelling)_OUTLIER_ENCODING) make_blobs)fowlkes_mallows_score)_VALID_METRICSeuclidean_distances)BallTreeKDTree)StandardScaler)shuffle)assert_allcloseassert_array_equal)CSC_CONTAINERSCSR_CONTAINERS ) n_samples random_state)r)kd_tree ball_treebruteautolabelcrtt|tz }|dk(sJt|t|kDsJy)N)lenset OUTLIER_SETr y)labels threshold n_clusterss h/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/sklearn/cluster/tests/test_hdbscan.pycheck_label_qualityr+)s6S[;./J ??  +i 77 7 outlier_typectjtjd|}ddd|}t|d}t|d}tj }|dg|d<||g|d<t j|}|j|k(j\}t|ddg||j|j\}t|ddgttddttd d z} t j|| } t| j|j| y ) O Tests if np.inf and np.nan data are each treated as special outliers. )infinitemissingc ||k(SNxr&s r*z#test_outlier_data..9s ar,c,tj|Sr3)npisnanr5s r*r7z#test_outlier_data..:s r,r probrrN)r9infnanr Xcopyrfitlabels_nonzerorprobabilities_listrange) r-outlier prob_checkr r; X_outliermodelmissing_labels_idxmissing_probs_idx clean_indices clean_models r*test_outlier_datarQ/s8 FF66G (+J l +G 4E \ *6 2DIQ)Aq62&u';';TBKKM(1a&1q!%U1c](;;M)-- - 89K{**EMM-,HIr,ctt}|j}tddj |}t ||t |d}tjt|5tddj tdddd}d|d <d |d <tjt|5td j |dddy#1swYUxYw#1swYyxYw) zy Tests that HDBSCAN works with precomputed distance matrices, and throws the appropriate errors when needed. precomputedT)metricrBz*The precomputed distance matrix.*has shapematchNz'The precomputed distance matrix.*valuesr)rr<r<)r<rrT) r rArBr fit_predictrr+pytestraises ValueError)D D_originalr'msgs r*test_hdbscan_distance_matrixr_Os AAJ M 5 A A! DFAz" 7C z -@}40<E?3E<?Fcztjt}|jd}t |dy)z Tests that HDBSCAN can generate a sufficiently accurate dbscan clustering. This test is more of a sanity check than a rigorous evaluation. 333333?gq= ףp?)r(N)rrCrAdbscan_clusteringr+) clustererr's r*test_dbscan_clusteringrs0  a I  ( ( -F$/r, cut_distance)皙??r<ctdd}tdd}tj}tjdg|d<dtj g|d<tjtj g|d<t j|}|j|}tj||k(}t|ddgtj||k(}t|dgtttd t||zz }t j||} | j|} t| ||y ) r/r1r r0r<rrsr=)rrN)r rArBr9r?r@rrCr flatnonzerorrGr$rH) r missing_labelinfinite_labelrKrLr'rMinfinite_labels_idx clean_idxrP clean_labelss r*#test_dbscan_clustering_outlier_datars& &i09M&z27;NIFFA;IaLrvv;IaLFFBFF#IaL IMM) $E  $ $, $ ?F-(?@)Aq62..>)AB*QC0Ss_s+=@S+S'TTUI)-- ) 45K00l0KL|VI%67r,ctddtjtjdij t}t |y)z4 Tests that HDBSCAN using `BallTree` works. rvrqr<)rTryN)rr9r|rAr{rXr+rls r*!test_hdbscan_best_balltree_metricrs?C1D+Ek!n r,ctttdz jt}t |j t sJy)z Tests that HDBSCAN correctly does not generate a valid cluster when the `min_cluster_size` is too large for the data. r<min_cluster_sizeN)rr#rArXr$issubsetr%rls r*test_hdbscan_no_clustersrs9 c!fqj 1 = =a @F v;   ,, ,r,c,tdttdD]r}t|j t}|Dcgc] }|dk7s | }}t|dk7sFt j t j||k\rrJycc}w)zb Test that the smallest non-noise cluster has at least `min_cluster_size` many points rsr<rrrN)rHr#rArrXr9minbincount)rr'r true_labelss r*test_hdbscan_min_cluster_sizers "!SVQ/H*:;GGJ*0@ERKu@ @ { q 66"++k237GG GG H@s  B Bcxtj}t|jt}t |y)zA Tests that HDBSCAN works when passed a callable metric. rWN)r euclideanrrXrAr+)rTr's r*test_hdbscan_callable_metricrs,  F F # / / 2Fr,treerrctd|}d}tjt|5|j t dddy#1swYyxYw)z Tests that HDBSCAN correctly raises an error when passing precomputed data while requesting a tree-based algorithm. rSrTrpz%precomputed is not a valid metric forrUN)rrYrZr[rCrA)rrr^s r*"test_hdbscan_precomputed_non_brutersC $ 7C 1C z -  s A  A csr_containerc0tjtj}t ||t}|j }tj|j}t ||tjdftjdffD]\}}tj }||d<tj|j}t ||dt|dk(sJ|j }||d<tj|j}t ||d}tjt|5tdd j|d d d y #1swYy xYw) z Tests that HDBSCAN works correctly when passing sparse feature data. Evaluates correctness by comparing against the same data passed as a dense array. r0r1rrrr z4Sparse data matrices only support algorithm `brute`.rUrrrN)rrCrArDr+rBrr9r?r@r rYrZr[) r dense_labels _X_sparseX_sparse sparse_labels outlier_valr-X_denser^s r*test_hdbscan_sparser sM9==#++L %a I~~HIMM(+33M|]3(*vvz&:RVVY>#$ h/77 <7 8 AC z -I{k:>>xHIIIs &F  Frpcddg}tdd|d\}}tdj|}t||j|j D]$\}}}t ||d d t ||d d &t|dtjd jt}|jjddk(sJ|j jddk(sJy )zj Tests that HDBSCAN centers are calculated and stored properly, and are accurate to the data. )rcrc)@rirr)rrcenters cluster_stdboth) store_centersr<g?)rtolatol)rprrN) r rrCzip centroids_medoids_rrAr{)rprH_rcentercentroidmedoids r*test_hdbscan_centersr-s :&G 1gSV WDAq  ' + +A .C$'$N; &qt<QT:; 6AGGAJ  c!f >>   "a '' ' <<  a A %% %r,ctjjd}|jdd}t ddddj |}tj |d \}}t|dk(sJ||d k(d kDsJt dd ddd j |}tj |d \}}t|dk(sJ||d k(dk(sJy)zS Tests that HDBSCAN single-cluster selection with epsilon works correctly. rrsr=rceomT)rcluster_selection_epsiloncluster_selection_methodallow_single_cluster) return_countsrg ףp= ?r)rrrrrpN)r9random RandomStaterandrrXuniquer#)rng no_structurer' unique_labelscountss r*.test_hdbscan_allow_single_cluster_with_epsilonrCs ))   "C88C#L "%!&!  k,  IIfDAM6 }  "" " -2% & ++ +"&!&!  k,  IIfDAM6 }  "" " -2% &! ++ +r,cddgddgddgddgg}td|gdd\}}tj|j}t t |t d |vz }|d k(sJt||d kDy ) z Validate that HDBSCAN can properly cluster this difficult synthetic dataset. Note that DBSCAN fails on this (see HDBSCAN plotting example) g333333g333333?r"i)皙?gffffff?皙?rr)rrrrrGz?N)r rrCrDr#r$intr )rrAr&r'r)s r*test_hdbscan_better_than_dbscanrds u~t}q!fq"g>G +  DAq Y]]1  % %FS[!Cf $55J ??&!$t+r,z kwargs, XrSr<rsr"rc<tdddi|j|y)zo Tests that HDBSCAN works correctly for array-likes and precomputed inputs with non-finite points. min_samplesr<Nr4)rrC)rAkwargss r*test_hdbscan_usable_inputsrxs $$V$((+r,c|tjd}d}tjt|5t dj |dddy#1swYyxYw)zd Tests that HDBSCAN raises the correct error when there are too few non-zero distances. )rrz#There exists points with fewer thanrUrSrWN)r9zerosrYrZr[rrCrrAr^s r*-test_hdbscan_sparse_distances_too_few_nonzerorsR bhhx()A /C z --}%))!,---s AA'c"tjd}d|ddddf<d|ddddf<||jz}||}d}tjt |5t d j|dddy#1swYyxYw) zu Tests that HDBSCAN raises the correct error when the distance matrix has multiple connected components. )rr<Nr=z3HDBSCAN cannot be performed on a disconnected graphrUrSrW)r9rTrYrZr[rrCrs r*0test_hdbscan_sparse_distances_disconnected_graphrs AAbqb"1"fIAab"#gJ ACCAaA ?C z --}%))!,---s BBcd}d}tjt|5td|j t dddtjt|5td|j t dddt ttjttjz }t|dkDrHtjt|5td|dj t dddyy#1swYxYw#1swYxYw#1swYyxYw) zR Tests that HDBSCAN correctly raises an error for invalid metric choices. c|Sr3r4)r6s r*r7z2test_hdbscan_tree_invalid_metric..sr,zV.* is not a valid metric for a .*-based algorithm\. Please select a different metric\.rUr)rprTNrr) rYrZr[rrCrArGr$rr~rr#)metric_callabler^metrics_not_kds r* test_hdbscan_tree_invalid_metricrs "O  z -D)O<@@CD z -F+o>BB1EF #h445FQ ]]:S 1 J iq0A B F Fq I J JDDFF J Js#!D!%!D-3$D9!D*-D69Ectttdz}d}tjt |5|j tdddy#1swYyxYw)zx Tests that HDBSCAN correctly raises an error when setting `min_samples` larger than the number of samples. r<)rz min_samples (.*) must be at mostrUN)rr#rArYrZr[rC)rr^s r*!test_hdbscan_too_many_min_samplesrsI c!fqj )C -C z -  s AA"ctj}tj|d<d}t d}t j t|5|j|dddy#1swYyxYw)zu Tests that HDBSCAN correctly raises an error when providing precomputed distances with `np.nan` values. rz(np.nan values found in precomputed-denserSrWrUN) rArBr9r@rrYrZr[rC)X_nanr^rs r*"test_hdbscan_precomputed_dense_nanrsY FFHE&&E$K 4C  'C z - s A,,A5rTFepsilonrcPd}t||ddgddgddgg\}}tj|}t|j|j }|dz|dz|dzh}|dzd|dzd |dzdi} t ||| || } tt|D cic]!} | tj|| k(dd#} } tt|D cic] } | | | |  } } tj| j|}t| |y cc} wcc} w) zR Tests that the `_do_labelling` helper function correctly assigns labels. 0rr)rrrrsr"rr<condensed_treeclusterscluster_label_maprrN)r rrCr_single_linkage_tree_rrrGr$r9where vectorizer}r)global_random_seedrrrrAr&estrrrr'_yfirst_with_label y_to_labelsaligned_targets r*test_labelling_distinctrsJ I 'F G G DAq )-- C# !!C4H4HNA y1}i!mB3q6lK2v.r233KKK2R\\+//215Nv~.LKs &DD#cLd}d}tjdd|dfddd|dfddgt }t||h|d|dzdid d }|d dk}t |t |d k(k(sJt||h|d|dzdid d }|d |k}t |t |d k(k(sJy)z Tests that the `_do_labelling` helper function correctly thresholds the incoming lambda values given various `cluster_selection_epsilon` values. r=g?rsr<)r=r<rr<r)r=r"rr<)r=rrr<)dtypeTrvaluerN)r9arrayrrsum)r MAX_LAMBDArr' num_noises r*test_labelling_thresholdingr s IJXX :q !  :q !     N%$aQ:!"# Fw'!+I y>S2. .. . %$aQ:!"# Fw'*4I y>S2. .. .r,rrrctjjd}|jd}t|}d}t j t |5td|j|dddy#1swYyxYw)zCheck that we raise an error if the centers are requested together with a precomputed input matrix. Non-regression test for: https://github.com/scikit-learn/scikit-learn/issues/27893 r)drsz>Cannot store centers when using a precomputed distance matrix.rUrS)rTrN) r9rrr rYrZr[rrC)rrrAX_disterr_msgs r*0test_hdbscan_error_precomputed_and_store_centersr%sq ))   "C 8A  #FNG z 1O}MBFFvNOOOs A??B valid_algorrcDtd|jty)zTest that HDBSCAN works with the "cosine" metric when the algorithm is set to "brute" or "auto". Non-regression test for issue #28631 cosinerN)rrXrA)rs r**test_hdbscan_cosine_metric_valid_algorithmr5s 8z2>>qAr, invalid_algoctd|}tjtd5|j t dddy#1swYyxYw)zTest that HDBSCAN raises an informative error is raised when an unsupported algorithm is used with the "cosine" metric. rrzcosine is not a valid metricrUN)rrYrZr[rXrA)rhdbscans r*,test_hdbscan_cosine_metric_invalid_algorithmr?sB X>G z)G HAs AA)r)J__doc__numpyr9rYscipyr scipy.spatialrsklearn.clusterrsklearn.cluster._hdbscan._treerrr sklearn.cluster._hdbscan.hdbscanr sklearn.datasetsr sklearn.metricsr sklearn.metrics.pairwiser r sklearn.neighborsrrsklearn.preprocessingr sklearn.utilsrsklearn.utils._testingrrsklearn.utils.fixesrrrAr& fit_transform ALGORITHMSitemsr%r+mark parametrizerQr_rjrmrrrrrrrrrrrrrr?rrrrrrrr rrr)routs00r*r-s  "# ?'1H.0!F>Cb11q!!$1""1% d1H1B1H1H1JKvq#c'lKK 8 ):;J<J>50-/Q/Q./QR S "  ,>2$3-$N 078884 - H )[!9:;.9I:IDj1&2&*,B,( M "HBHHq"&&kBFFA;-G$HI M "aVaV$45 q!fq!f ,,.9 -: -.9 -: - J0 /$?QH-!/.@!/H&/R:x*@A OB O'89B:B)[)ABCuLs0M