L ixdZddlZddlZddlmZddlmZmZm Z m Z m Z m Z ddl mZmZmZddlmZddlmZdd lmZd Zgd ZGd d eZe d"dZe ddddd#dZd#dZd$dZe dddedd%dddZ dZ!dZ"dZ#e"e!e#dZ$dZ%dZ&e%e&d Z'e ddded d&ddd!Z(y)'a K-means clustering and vector quantization (:mod:`scipy.cluster.vq`) ==================================================================== Provides routines for k-means clustering, generating code books from k-means models and quantizing vectors by comparing them with centroids in a code book. .. autosummary:: :toctree: generated/ whiten -- Normalize a group of observations so each feature has unit variance vq -- Calculate code book membership of a set of observation vectors kmeans -- Perform k-means on a set of observation vectors forming k clusters kmeans2 -- A different implementation of k-means with more methods -- for initializing centroids Background information ---------------------- The k-means algorithm takes as input the number of clusters to generate, k, and a set of observation vectors to cluster. It returns a set of centroids, one for each of the k clusters. An observation vector is classified with the cluster number or centroid index of the centroid closest to it. A vector v belongs to cluster i if it is closer to centroid i than any other centroid. If v belongs to i, we say centroid i is the dominating centroid of v. The k-means algorithm tries to minimize distortion, which is defined as the sum of the squared distances between each observation vector and its dominating centroid. The minimization is achieved by iteratively reclassifying the observations into clusters and recalculating the centroids until a configuration is reached in which the centroids are stable. One can also define a maximum number of iterations. Since vector quantization is a natural application for k-means, information theory terminology is often used. The centroid index or cluster index is also referred to as a "code" and the table mapping codes to centroids and, vice versa, is often referred to as a "code book". The result of k-means, a set of centroids, can be used to quantize vectors. Quantization aims to find an encoding of vectors that reduces the expected distortion. All routines expect obs to be an M by N array, where the rows are the observation vectors. The codebook is a k by N array, where the ith row is the centroid of code word i. The observation vectors and centroids have the same feature dimension. As an example, suppose we wish to compress a 24-bit color image (each pixel is represented by one byte for red, one for blue, and one for green) before sending it over the web. By using a smaller 8-bit encoding, we can reduce the amount of data by two thirds. Ideally, the colors for each of the 256 possible 8-bit encoding values should be chosen to minimize distortion of the color. Running k-means with k=256 generates a code book of 256 codes, which fills up all possible 8-bit sequences. Instead of sending a 3-byte value for each pixel, the 8-bit centroid index (or code word) of the dominating centroid is transmitted. The code book is also sent over the wire so each 8-bit code can be translated back to a 24-bit pixel value representation. If the image of interest was of an ocean, we would expect many 24-bit blues to be represented by 8-bit codes. If it was an image of a human face, more flesh-tone colors would be represented in the code book. N)deque)_asarrayarray_namespace is_lazy_arrayxp_capabilitiesxp_copyxp_size)check_random_state rng_integers_transition_to_rng)array_api_extra)cdist)_vqrestructuredtext)whitenvqkmeanskmeans2c eZdZy) ClusterErrorN)__name__ __module__ __qualname__V/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/scipy/cluster/vq.pyrrTsrrc2t|}| t| }t|||}|j|d}|dk(}t j ||j d}|r-|j|rtjdtd||z S)al Normalize a group of observations on a per feature basis. Before running k-means, it is beneficial to rescale each feature dimension of the observation set by its standard deviation (i.e. "whiten" it - as in "white noise" where each frequency has equal power). Each feature is divided by its standard deviation across all observations to give it unit variance. Parameters ---------- obs : ndarray Each row of the array is an observation. The columns are the features seen during each observation:: # f0 f1 f2 obs = [[ 1., 1., 1.], #o0 [ 2., 2., 2.], #o1 [ 3., 3., 3.], #o2 [ 4., 4., 4.]] #o3 check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True for eager backends and False for lazy ones. Returns ------- result : ndarray Contains the values in `obs` scaled by the standard deviation of each column. Examples -------- >>> import numpy as np >>> from scipy.cluster.vq import whiten >>> features = np.array([[1.9, 2.3, 1.7], ... [1.5, 2.5, 2.2], ... [0.8, 0.6, 1.7,]]) >>> whiten(features) array([[ 4.17944278, 2.69811351, 7.21248917], [ 3.29956009, 2.93273208, 9.33380951], [ 1.75976538, 0.7038557 , 7.21248917]]) ) check_finitexpraxis?zWSome columns have standard deviation zero. The values of these columns will not change. stacklevel) rrrstdxpxatsetanywarningswarnRuntimeWarning)obsrr std_dev zero_std_masks rrrXs`  B(-- 3\b 9CffSqf!GqLMffWm,005G}- E$ 4 =rTzuses spatial.distance.cdistF)cpu_onlyreasonjax_jitallow_dask_computect||}t|||}t|||}|j||}|j|dr|j ||d}|j ||d}t j |}t j |}tj||}|j |d|j |dfSt||dS) aH Assign codes from a code book to observations. Assigns a code from a code book to each observation. Each observation vector in the 'M' by 'N' `obs` array is compared with the centroids in the code book and assigned the code of the closest centroid. The features in `obs` should have unit variance, which can be achieved by passing them through the whiten function. The code book can be created with the k-means algorithm or a different encoding algorithm. Parameters ---------- obs : ndarray Each row of the 'M' x 'N' array is an observation. The columns are the "features" seen during each observation. The features must be whitened first using the whiten function or something equivalent. code_book : ndarray The code book is usually generated using the k-means algorithm. Each row of the array holds a different code, and the columns are the features of the code:: # f0 f1 f2 f3 code_book = [[ 1., 2., 3., 4.], #c0 [ 1., 2., 3., 4.], #c1 [ 1., 2., 3., 4.]] #c2 check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- code : ndarray A length M array holding the code book index for each observation. dist : ndarray The distortion (distance) between the observation and its nearest code. Examples -------- >>> import numpy as np >>> from scipy.cluster.vq import vq >>> code_book = np.array([[1., 1., 1.], ... [2., 2., 2.]]) >>> features = np.array([[1.9, 2.3, 1.7], ... [1.5, 2.5, 2.2], ... [0.8, 0.6, 1.7]]) >>> vq(features, code_book) (array([1, 1, 0], dtype=int32), array([0.43588989, 0.73484692, 0.83066239])) r rz real floating)kindF)copyrrr) rr result_typeisdtypeastypenpasarrayrrpy_vq)r/ code_bookrr ctc_obs c_code_bookresults rrrsv i (B 32L 9Cr EI Y 'B zz"?z+ #r .ii 2Ei:  5!jj- {+zz&)$bjj&;;; ie 44rct||}t|||}t|||}|j|jk7r td|jdk(r&|dd|jf}|dd|jf}|j t ||}|j|d}|j|d}||fS)a Python version of vq algorithm. The algorithm computes the Euclidean distance between each observation and every frame in the code_book. Parameters ---------- obs : ndarray Expects a rank 2 array. Each row is one observation. code_book : ndarray Code book to use. Same format than obs. Should have same number of features (e.g., columns) than obs. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True Returns ------- code : ndarray code[i] gives the label of the ith obversation; its code is code_book[code[i]]. mind_dist : ndarray min_dist[i] gives the distance between the ith observation and its corresponding code. Notes ----- This function is slower than the C version but works for all input types. If the inputs have the wrong types for the C versions of the function, this one is called as a last resort. It is about 20 times slower than the C version. r7z3Observation and code_book should have the same rankrNr!) rrndim ValueErrornewaxisr?rargminmin)r/rArr distcodemin_dists rr@r@sJ i (B 32L 9Cr EI xx9>>!NOO xx1}!RZZ- am,  ::eC+ ,D 99T9 "Dvvdv#H >rc|tn|}|}|j}t|gd}tj|}||kDrt ||d\}} |j |j | dtj|}tj|||jd\}} || }|j|}|j|d|dz }||kDr||dfS) aZ "raw" version of k-means. Returns ------- code_book The lowest distortion codebook found. avg_dist The average distance a observation is from a code in the book. Lower means the code_book matches the data better. See Also -------- kmeans : wrapper around k-means Examples -------- Note: not whitened in this example. >>> import numpy as np >>> from scipy.cluster.vq import _kmeans >>> features = np.array([[ 1.9,2.3], ... [ 1.5,2.5], ... [ 0.8,0.6], ... [ 0.4,1.8], ... [ 1.0,1.0]]) >>> book = np.array((features[0],features[2])) >>> _kmeans(features,book) (array([[ 1.7 , 2.4 ], [ 0.73333333, 1.13333333]]), 0.40563916697728591) r$)maxlenFr:r!rr) r>infrr?rappendmeanrupdate_cluster_meansshapeabs) r/guessthreshr rAdiffprev_avg_distsnp_obsobs_codedistort has_memberss r_kmeansr`s@zrBI 66DD6!,N ZZ_F -sIEB'bgggBg78::h'!$!9!9&(:C//!:L"N ;k* JJy) vvnQ'.*;;< - nQ' ''r)r2r4r5seed)rngc*t|tr t|}n t||}t|||}t|||}|dkrt d|t |dk7r+t |dkrt d|t ||||St|}||k7r t d|dkrt d|dt|}|j} t|D],} t||||}t ||||\} } | | ks)| } | } . | fS) a Performs k-means on a set of observation vectors forming k clusters. The k-means algorithm adjusts the classification of the observations into clusters and updates the cluster centroids until the position of the centroids is stable over successive iterations. In this implementation of the algorithm, the stability of the centroids is determined by comparing the absolute value of the change in the average Euclidean distance between the observations and their corresponding centroids against a threshold. This yields a code book mapping centroids to codes and vice versa. Parameters ---------- obs : ndarray Each row of the M by N array is an observation vector. The columns are the features seen during each observation. The features must be whitened first with the `whiten` function. k_or_guess : int or ndarray The number of centroids to generate. A code is assigned to each centroid, which is also the row index of the centroid in the code_book matrix generated. The initial k centroids are chosen by randomly selecting observations from the observation matrix. Alternatively, passing a k by N array specifies the initial k centroids. iter : int, optional The number of times to run k-means, returning the codebook with the lowest distortion. This argument is ignored if initial centroids are specified with an array for the ``k_or_guess`` parameter. This parameter does not represent the number of iterations of the k-means algorithm. thresh : float, optional Terminates the k-means algorithm if the change in distortion since the last k-means iteration is less than or equal to threshold. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True rng : `numpy.random.Generator`, optional Pseudorandom number generator state. When `rng` is None, a new `numpy.random.Generator` is created using entropy from the operating system. Types other than `numpy.random.Generator` are passed to `numpy.random.default_rng` to instantiate a ``Generator``. Returns ------- codebook : ndarray A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the ``k_or_guess`` parameter, because centroids assigned to no observations are removed during iterations. distortion : float The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances. See Also -------- kmeans2 : a different implementation of k-means clustering with more methods for generating initial centroids but without using a distortion change threshold as a stopping criterion. whiten : must be called prior to passing an observation matrix to kmeans. Notes ----- For more functionalities or optimal performance, you can use `sklearn.cluster.KMeans `_. `This `_ is a benchmark result of several implementations. Examples -------- >>> import numpy as np >>> from scipy.cluster.vq import vq, kmeans, whiten >>> import matplotlib.pyplot as plt >>> features = np.array([[ 1.9,2.3], ... [ 1.5,2.5], ... [ 0.8,0.6], ... [ 0.4,1.8], ... [ 0.1,0.1], ... [ 0.2,1.8], ... [ 2.0,0.5], ... [ 0.3,1.5], ... [ 1.0,1.0]]) >>> whitened = whiten(features) >>> book = np.array((whitened[0],whitened[2])) >>> kmeans(whitened,book) (array([[ 2.3110306 , 2.86287398], # random [ 0.93218041, 1.24398691]]), 0.85684700941625547) >>> codes = 3 >>> kmeans(whitened,codes) (array([[ 2.3110306 , 2.86287398], # random [ 1.32544402, 0.65607529], [ 0.40782893, 2.02786907]]), 0.5196582527686241) >>> # Create 50 datapoints in two clusters a and b >>> pts = 50 >>> rng = np.random.default_rng() >>> a = rng.multivariate_normal([0, 0], [[4, 1], [1, 4]], size=pts) >>> b = rng.multivariate_normal([30, 10], ... [[10, 2], [2, 1]], ... size=pts) >>> features = np.concatenate((a, b)) >>> # Whiten data >>> whitened = whiten(features) >>> # Find 2 clusters in the data >>> codebook, distortion = kmeans(whitened, 2) >>> # Plot whitened data and cluster centers in red >>> plt.scatter(whitened[:, 0], whitened[:, 1]) >>> plt.scatter(codebook[:, 0], codebook[:, 1], c='r') >>> plt.show() r7rziter must be at least 1, got z'Asked for 0 clusters. Initial book was )rYr z1If k_or_guess is a scalar, it must be an integer.z Asked for z clusters.) isinstanceintrrrHr r`r rRrange_kpoints)r/ k_or_guessiterrYrrbr rXk best_distibookrL best_books rrrLs=H*c" S ! S* - 32L 9C ZB\ BE ax8?@@u~ 5>A FugNO OsE&R88 E AEzLMM1u:aS 344 S !CI 4[ab)S%2> d ) II  i rc|j|jdt|d}|j||jdgj}|j ||dS)aPick k points at random in data (one row = one observation). Parameters ---------- data : ndarray Expect a rank 1 or 2 array. Rank 1 are assumed to describe one dimensional data, rank 2 multidimensional data, in which case one row is one observation. k : int Number of samples to generate. rng : `numpy.random.Generator` or `numpy.random.RandomState` Random number generator. Returns ------- x : ndarray A 'k' by 'N' containing the initial centroids rF)sizereplacer)dtyper!)choicerVrer?rrtake)datarjrbr idxs rrgrgs[( **TZZ]Q* ?C **S A3 5 5* 6C 77417 %%rc|j|d}tj|}|jdk(rPt j ||}|j |}|j|}||j|z}nA|jd|jdkDr|jj||z d\}}} |j |t|f}|j|}|dddf| z|j|jd|jd z z } || z}nt jt j |j|d | }|j |t|f}|j|}||jj|jz}||z }|S) aReturns k samples of a random variable whose parameters depend on data. More precisely, it returns k observations sampled from a Gaussian random variable whose mean and covariances are the ones estimated from the data. Parameters ---------- data : ndarray Expect a rank 1 or 2 array. Rank 1 is assumed to describe 1-D data, rank 2 multidimensional data, in which case one row is one observation. k : int Number of samples to generate. rng : `numpy.random.Generator` or `numpy.random.RandomState` Random number generator. Returns ------- x : ndarray A 'k' by 'N' containing the initial centroids rr!rr )rpF) full_matricesNr#r$)rGr )rTr>r?rGr(covstandard_normalsqrtrVlinalgsvdr atleast_ndTcholesky) rurjrbr mu_covx_svhsVhs r _krandinitrs. A B 1 A yyA~wwt#   Q  ' JJqM RWWT] AA &99==%=@1b   a_  5 JJqM4j2o 1  2(F GG G~~cggdff41D   a%5  6 JJqM  ""4(** *GA Hrct|j}|dk(r |dddf}|jd}|jt||f}t |D]}|dk(rt ||jd}nt |d|ddf|djd} | | jz } | j} |j} tj| } ttj| | }tj||ddfj!||ddf}|dk(r |dddf}|S)a Picks k points in the data based on the kmeans++ method. Parameters ---------- data : ndarray Expect a rank 1 or 2 array. Rank 1 is assumed to describe 1-D data, rank 2 multidimensional data, in which case one row is one observation. k : int Number of samples to generate. rng : `numpy.random.Generator` or `numpy.random.RandomState` Random number generator. Returns ------- init : ndarray A 'k' by 'N' containing the initial centroids. References ---------- .. [1] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding", Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007. rNr sqeuclidean)metricr!)lenrVemptyrerfr rrKsumcumsumuniformr>r? searchsortedr(r)r*) rurjrbr rGdimsinitrldata_idxD2probscumprobsrs r_kpprAs+4 tzz?D qyAtG} ::a=D 88SVTN #D 1X 9 6#CA7HtBQBqDz4 >BBBJBrvvxKE||~H Azz(+H2??8Q78Hvvd|AqD!%%d8Q;&78 9 qyAqDz Kr)randompointsz++c2tjddy)zPrint a warning when called.LOne of the clusters is empty. Re-run kmeans with a different initialization.r%N)r,r-rrr _missing_warnrxs MMC rctd)z!Raise a ClusterError when called.r)rrrr_missing_raisers H IIr)r-raisect|dkrtd|d t|}t |tr t |} n t ||} t || |}t|| } |jdk(rd} n*|jdk(r|jd} n td t|dkst| dkr td |d k(st| dkDr_|j| jk7r td | jd } |jdkDr{| jd| k7ritdt| } | dkrtd| d| d| | k7rtjdd t|}t|}||| || } tj |}tj | } t#|D]P}t%|| |d }t'j(||| \}}|j+s|| |||<|} R| j!| | j!fS#t$r} td|| d} ~ wwxYw#t$r} td|| d} ~ wwxYw)a Classify a set of observations into k clusters using the k-means algorithm. The algorithm attempts to minimize the Euclidean distance between observations and centroids. Several initialization methods are included. Parameters ---------- data : ndarray A 'M' by 'N' array of 'M' observations in 'N' dimensions or a length 'M' array of 'M' 1-D observations. k : int or ndarray The number of clusters to form as well as the number of centroids to generate. If `minit` initialization string is 'matrix', or if a ndarray is given instead, it is interpreted as initial cluster to use instead. iter : int, optional Number of iterations of the k-means algorithm to run. Note that this differs in meaning from the iters parameter to the kmeans function. thresh : float, optional (not used yet) minit : str, optional Method for initialization. Available methods are 'random', 'points', '++' and 'matrix': 'random': generate k centroids from a Gaussian with mean and variance estimated from the data. 'points': choose k observations (rows) at random from data for the initial centroids. '++': choose k observations accordingly to the kmeans++ method (careful seeding) 'matrix': interpret the k parameter as a k by M (or length k array for 1-D data) array of initial centroids. missing : str, optional Method to deal with empty clusters. Available methods are 'warn' and 'raise': 'warn': give a warning and continue. 'raise': raise an ClusterError and terminate the algorithm. check_finite : bool, optional Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True rng : `numpy.random.Generator`, optional Pseudorandom number generator state. When `rng` is None, a new `numpy.random.Generator` is created using entropy from the operating system. Types other than `numpy.random.Generator` are passed to `numpy.random.default_rng` to instantiate a ``Generator``. Returns ------- centroid : ndarray A 'k' by 'N' array of centroids found at the last iteration of k-means. label : ndarray label[i] is the code or index of the centroid the ith observation is closest to. See Also -------- kmeans References ---------- .. [1] D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding", Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007. Examples -------- >>> from scipy.cluster.vq import kmeans2 >>> import matplotlib.pyplot as plt >>> import numpy as np Create z, an array with shape (100, 2) containing a mixture of samples from three multivariate normal distributions. >>> rng = np.random.default_rng() >>> a = rng.multivariate_normal([0, 6], [[2, 1], [1, 1.5]], size=45) >>> b = rng.multivariate_normal([2, 0], [[1, -1], [-1, 3]], size=30) >>> c = rng.multivariate_normal([6, 4], [[5, 0], [0, 1.2]], size=25) >>> z = np.concatenate((a, b, c)) >>> rng.shuffle(z) Compute three clusters. >>> centroid, label = kmeans2(z, 3, minit='points') >>> centroid array([[ 2.22274463, -0.61666946], # may vary [ 0.54069047, 5.86541444], [ 6.73846769, 4.01991898]]) How many points are in each cluster? >>> counts = np.bincount(label) >>> counts array([29, 51, 20]) # may vary Plot the clusters. >>> w0 = z[label == 0] >>> w1 = z[label == 1] >>> w2 = z[label == 2] >>> plt.plot(w0[:, 0], w0[:, 1], 'o', alpha=0.5, label='cluster 0') >>> plt.plot(w1[:, 0], w1[:, 1], 'd', alpha=0.5, label='cluster 1') >>> plt.plot(w2[:, 0], w2[:, 1], 's', alpha=0.5, label='cluster 2') >>> plt.plot(centroid[:, 0], centroid[:, 1], 'k*', label='centroids') >>> plt.axis('equal') >>> plt.legend(shadow=True) >>> plt.show() rzInvalid iter (z), must be a positive integer.zUnknown missing method Nr7rxr$z#Input of rank > 2 is not supported.zEmpty input is not supported.matrixzk array doesn't match data rankrz$k array doesn't match data dimensionzCannot ask kmeans2 for z clusters (k was )z$k was not an integer, was converted.r%zUnknown init method r:)rerH_valid_miss_methKeyErrorrdrrrrGrVr r,r-_valid_init_methr r>r?rfrrrUall)rurjrirYminitmissingrrb miss_mether rAdnc init_methrlabel new_code_bookr_s rrrsv 4y1}>$/MNOOG$W- !S T " T1 % DRl ;Db!I yyA~  a JJqM>??t}qGI.2899 GI.2 99  &>? ? __Q  99q=Y__Q/14CD D ^ 6)"->ykK 9_ MM@Q O <(/I%S)C!$ 3;I ::d D 9%I 4[ "4>qA%(%=%=dE2%N" { K*3[L*AM;, '!  " ::i "**U"3 33s G27+>?QFGL F3E9=>A E Fs/ I? I( I%I  I%( J1JJ)N)T)h㈵>N)rT) rrr-T))__doc__r,numpyr> collectionsrscipy._lib._array_apirrrrrr scipy._lib._utilr r r scipy._libr r(scipy.spatial.distancerr __docformat____all__ Exceptionrrrr@r`rrgrrrrrrrrrrrsAADFF22-(" / 9 ::z$'D49E59E5P4n2(j$$GFc c Hc L&40 f1h)HDI I *NC$$GF19)-v46:v4Hv4r