`L iQddlZddlmZddlZddlmZmZmZddl m Z ddl m Z m Z mZddlmZmZddlmZmZmZmZmZd d lmZGd d eeZy) N)Integral) BaseEstimatorTransformerMixin _fit_context)resample)IntervalOptions StrOptions)_averaged_weighted_percentile_weighted_percentile)_check_feature_names_in_check_sample_weight check_arraycheck_is_fitted validate_data) OneHotEncoderc *eZdZUdZeeddddgehdgehdgehd geee je jhdgeed dddgd gd Z e ed < dddddddddZedddZdZdZdZddZy)KBinsDiscretizera& Bin continuous data into intervals. Read more in the :ref:`User Guide `. .. versionadded:: 0.20 Parameters ---------- n_bins : int or array-like of shape (n_features,), default=5 The number of bins to produce. Raises ValueError if ``n_bins < 2``. encode : {'onehot', 'onehot-dense', 'ordinal'}, default='onehot' Method used to encode the transformed result. - 'onehot': Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right. - 'onehot-dense': Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right. - 'ordinal': Return the bin identifier encoded as an integer value. strategy : {'uniform', 'quantile', 'kmeans'}, default='quantile' Strategy used to define the widths of the bins. - 'uniform': All bins in each feature have identical widths. - 'quantile': All bins in each feature have the same number of points. - 'kmeans': Values in each bin have the same nearest center of a 1D k-means cluster. For an example of the different strategies see: :ref:`sphx_glr_auto_examples_preprocessing_plot_discretization_strategies.py`. quantile_method : {"inverted_cdf", "averaged_inverted_cdf", "closest_observation", "interpolated_inverted_cdf", "hazen", "weibull", "linear", "median_unbiased", "normal_unbiased"}, default="linear" Method to pass on to np.percentile calculation when using strategy="quantile". Only `averaged_inverted_cdf` and `inverted_cdf` support the use of `sample_weight != None` when subsampling is not active. .. versionadded:: 1.7 dtype : {np.float32, np.float64}, default=None The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported. .. versionadded:: 0.24 subsample : int or None, default=200_000 Maximum number of samples, used to fit the model, for computational efficiency. `subsample=None` means that all the training samples are used when computing the quantiles that determine the binning thresholds. Since quantile computation relies on sorting each column of `X` and that sorting has an `n log(n)` time complexity, it is recommended to use subsampling on datasets with a very large number of samples. .. versionchanged:: 1.3 The default value of `subsample` changed from `None` to `200_000` when `strategy="quantile"`. .. versionchanged:: 1.5 The default value of `subsample` changed from `None` to `200_000` when `strategy="uniform"` or `strategy="kmeans"`. random_state : int, RandomState instance or None, default=None Determines random number generation for subsampling. Pass an int for reproducible results across multiple function calls. See the `subsample` parameter for more details. See :term:`Glossary `. .. versionadded:: 1.1 Attributes ---------- bin_edges_ : ndarray of ndarray of shape (n_features,) The edges of each bin. Contain arrays of varying shapes ``(n_bins_, )`` Ignored features will have empty arrays. n_bins_ : ndarray of shape (n_features,), dtype=np.int64 Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning. n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- Binarizer : Class used to bin values as ``0`` or ``1`` based on a parameter ``threshold``. Notes ----- For a visualization of discretization on different datasets refer to :ref:`sphx_glr_auto_examples_preprocessing_plot_discretization_classification.py`. On the effect of discretization on linear models see: :ref:`sphx_glr_auto_examples_preprocessing_plot_discretization.py`. In bin edges for feature ``i``, the first and last values are used only for ``inverse_transform``. During transform, bin edges are extended to:: np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf]) You can combine ``KBinsDiscretizer`` with :class:`~sklearn.compose.ColumnTransformer` if you only want to preprocess part of the features. ``KBinsDiscretizer`` might produce constant features (e.g., when ``encode = 'onehot'`` and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., :class:`~sklearn.feature_selection.VarianceThreshold`). Examples -------- >>> from sklearn.preprocessing import KBinsDiscretizer >>> X = [[-2, 1, -4, -1], ... [-1, 2, -3, -0.5], ... [ 0, 3, -2, 0.5], ... [ 1, 4, -1, 2]] >>> est = KBinsDiscretizer( ... n_bins=3, encode='ordinal', strategy='uniform' ... ) >>> est.fit(X) KBinsDiscretizer(...) >>> Xt = est.transform(X) >>> Xt # doctest: +SKIP array([[ 0., 0., 0., 0.], [ 1., 1., 1., 0.], [ 2., 2., 2., 1.], [ 2., 2., 2., 2.]]) Sometimes it may be useful to convert the data back into the original feature space. The ``inverse_transform`` function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges. >>> est.bin_edges_[0] array([-2., -1., 0., 1.]) >>> est.inverse_transform(Xt) array([[-1.5, 1.5, -3.5, -0.5], [-0.5, 2.5, -2.5, -0.5], [ 0.5, 3.5, -1.5, 0.5], [ 0.5, 3.5, -1.5, 1.5]]) rNleft)closedz array-like> onehot-denseonehotordinal>kmeansuniformquantile> warnhazenlinearweibull inverted_cdfmedian_unbiasednormal_unbiasedclosest_observationaveraged_inverted_cdfinterpolated_inverted_cdfr random_staten_binsencodestrategyquantile_methoddtype subsampler)_parameter_constraintsrrri@ )r,r-r.r/r0r)cf||_||_||_||_||_||_||_yNr*)selfr+r,r-r.r/r0r)s k/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/sklearn/preprocessing/_discretization.py__init__zKBinsDiscretizer.__init__s7    . "(T)prefer_skip_nested_validationc  t||d}|jtjtjfvr |j}n |j}|j \}}|t |||j}|j5||jkDr&t|d|j|j|}d}|j d}|j|}tj|t}|j} |jdk(r!| dk(rtj d t"d } |jdk(r| d vr|t%d | d |jdk7r||dk7} n t'd} t)|D]} |dd| f} | | j+} | | j-}| |k(rUtj d| zd|| <tj.tj0 tj0g|| <|jdk(r"tj2| ||| dz|| <n|jdk(rtj2dd|| dz}i}| d k7r|| |d<|?tj4tj6| |fi|tj|| <n t8t:d | }tj4|Dcgc] }|| ||c}tj|| <n|jdk(rddlm}tj2| ||| dz}|dd|ddzdddfdz}||| |d}|jA| dddf|jBdddf}|jE|dd|ddzdz|| <tjF| || |f|| <|jdvsVtjH|| tj0dkD}|| ||| <tK|| dz || k7stj d| ztK|| dz || <||_&||_'d|jPvrtS|jNDcgc]}tjT|c}|jPdk(| |_+|jVjAtjdtK|jNf|Scc}wcc}w)!a Fit the estimator. Parameters ---------- X : array-like of shape (n_samples, n_features) Data to be discretized. y : None Ignored. This parameter exists only for compatibility with :class:`~sklearn.pipeline.Pipeline`. sample_weight : ndarray of shape (n_samples,) Contains weight values to be associated with each sample. .. versionadded:: 1.3 .. versionchanged:: 1.7 Added support for strategy="uniform". Returns ------- self : object Returns the instance itself. numericr/NT)replace n_samplesr) sample_weightrrra%The current default behavior, quantile_method='linear', will be changed to quantile_method='averaged_inverted_cdf' in scikit-learn version 1.9 to naturally support sample weight equivalence properties by default. Pass quantile_method='averaged_inverted_cdf' explicitly to silence this warning.r!)r#r'zWhen fitting with strategy='quantile' and sample weights, quantile_method should either be set to 'averaged_inverted_cdf' or 'inverted_cdf', got quantile_method='z ' instead.rz3Feature %d is constant and will be replaced with 0.rdmethod)percentile_rankrr)KMeans?) n_clustersinitn_init)r>)rr)to_beging:0yE>zqBins whose width are too small (i.e., <= 1e-8) in feature %d are removed. Consider decreasing the number of bins.r) categories sparse_outputr/),rr/npfloat64float32shaperr0rr)_validate_n_binszerosobjectr.r-warningsr FutureWarning ValueErrorslicerangeminmaxarrayinflinspaceasarray percentiler r clusterrBfitcluster_centers_sortr_ediff1dlen bin_edges_n_bins_r,rarange_encoder)r4Xyr> output_dtyper= n_featuresr+ bin_edgesr.nnz_weight_maskjjcolumncol_mincol_maxpercentile_levelspercentile_kwargspercentile_funcprB uniform_edgesrFkmcentersmaskis r5r_zKBinsDiscretizer.fits6 $ 3 ::"**bjj1 1::L77L ! :  $0QM >> %)dnn*D..!..+ A!MWWQZ &&z2HHZv6 .. ==J &?f+D MM  'O MMZ ''PP)88G7H T  ==J &=+D,q0O$DkO #J 8Bq"uXF_-113G_-113G'! IBNr "266'266): ; " }} ) " GWfRj1n M " *,$&KK3r Q$G! %'!"h.=3H2A%h/ ($&JJ f.?UCTU jj%IbM)=1N'&''O%'JJ&7 !,FMSTU!jj %IbM(*,!# GWfRj1n M %ab)M#2,>>4H3NvbzQG&&1d7O=!""1a4) !(ws|!;s B " "gy}g&E F " }} 66zz)B-"&&ADH )" d 3 " y}%)VBZ7MM9;=> "%Yr]!3a!7F2JUJ 8X$ t{{ ")26,,?QBIIaL?"kkX5"DM MM  bhh3t||+<'=> ? aP@s "T T c|j}t|trtj||t St |t dd}|jdkDs|jd|k7r td|dk||k7z}tj|d}|jddkDrAd jd |D}td jtj||S) z0Returns n_bins_, the number of bins per feature.r;TF)r/copy ensure_2drrz8n_bins must be a scalar or array of shape (n_features,).rz, c32K|]}t|ywr3)str).0r{s r5 z4KBinsDiscretizer._validate_n_bins..sB1ABszk{} received an invalid number of bins at indices {}. Number of bins must be at least 2, and must be an int.)r+ isinstancerrKfullintrndimrNrTwherejoinformatr__name__)r4rl orig_binsr+bad_nbins_valueviolating_indicesindicess r5rOz!KBinsDiscretizer._validate_n_binssKK i *77:y< <YcN ;;?fll1o;WX X!A:&I*=>HH_5a8  " "1 % )iiB0ABBG::@&$--w;  r7ct||j tjtjfn |j}t ||d|d}|j }t|jdD].}tj||dd|dd|fd|dd|f<0|jd k(r|Sd}d |jvr1|jj}|j|j_ |jj|}||j_|S#||j_wxYw) a Discretize the data. Parameters ---------- X : array-like of shape (n_samples, n_features) Data to be discretized. Returns ------- Xt : {ndarray, sparse matrix}, dtype={np.float32, np.float64} Data in the binned space. Will be a sparse matrix if `self.encode='onehot'` and ndarray otherwise. NTF)r}r/resetrrCright)siderr) rr/rKrLrMrrerVrN searchsortedr,rh transform)r4rir/Xtrmro dtype_initXt_encs r5rzKBinsDiscretizer.transforms -1JJ,>RZZ(DJJ 4U% HOO  $ VB " a(;R2YWUBq"uI V ;;) #I t{{ ",,J"$((DMM  -]],,R0F#-DMM  #-DMM s <D**D=c&t|d|jvr|jj|}t |dt j t jf}|jjd}|jd|k7r(tdj||jdt|D]O}|j|}|dd|ddzd z}||dd|fjt j|dd|f<Q|S) a Transform discretized data back to original feature space. Note that this function does not regenerate the original data due to discretization rounding. Parameters ---------- X : array-like of shape (n_samples, n_features) Transformed data in the binned space. Returns ------- X_original : ndarray, dtype={np.float32, np.float64} Data in the original feature space. rT)r}r/rrz8Incorrect number of features. Expecting {}, received {}.NrCrD)rr,rhinverse_transformrrKrLrMrfrNrTrrVreastypeint64)r4riXinvrlrorm bin_centerss r5rz"KBinsDiscretizer.inverse_transforms$  t{{ " //2A14 BJJ/GH\\''* ::a=J &JQQ 1    # FB+I$QR=9Sb>9S@K%tArE{&:&:288&DEDBK F  r7ct|dt||}t|dr|jj |S|S)aGet output feature names. Parameters ---------- input_features : array-like of str or None, default=None Input features. - If `input_features` is `None`, then `feature_names_in_` is used as feature names in. If `feature_names_in_` is not defined, then the following input feature names are generated: `["x0", "x1", ..., "x(n_features_in_ - 1)"]`. - If `input_features` is an array-like, then `input_features` must match `feature_names_in_` if `feature_names_in_` is defined. Returns ------- feature_names_out : ndarray of str objects Transformed feature names. n_features_in_rh)rrhasattrrhget_feature_names_out)r4input_featuress r5rz&KBinsDiscretizer.get_feature_names_out sB( ./0~F 4 $==66~F Fr7))NNr3)r __module__ __qualname____doc__r rr r typerKrLrMr1dict__annotations__r6rr_rOrrrr7r5rrs]@Haf=|LCDE ABC    $RZZ 894@xD@$G'(-$D6))&5|6||2%N%Nr7r)rRnumbersrnumpyrKbaserrrutilsrutils._param_validationr r r utils.statsr r utils.validationrrrrr _encodersrrrr7r5rsB @@CCM%L'Lr7