L i ddlZddlZddlZddlmZddlZddlmZmZm Z m Z m Z m Z m Z mZmZmZmZmZmZmZmZmZmZddlmZmZmZmZddlmZddlm Z m!Z!m"Z"ddl#m$Z$m%Z%m&Z&m'Z'dd l(m)Z)m*Z*dd l+m,Z,m-Z-dd l.m/Z/dd l,m0Z0m1Z1m2Z2m3Z3m4Z4m5Z5dd l6m7Z7ddl+m8Z8ddl9m:Z:ddl;meddZ?eddZ@eddZAddZBdZCeZ[dd?Z\ed@dAZ]e>> from scipy import stats >>> data = [6, 9, 12, 7, 8, 8, 13] >>> mean, var, std = stats.bayes_mvs(data) >>> mean Mean(statistic=9.0, minmax=(7.103650222612533, 10.896349777387467)) >>> var Variance(statistic=10.0, minmax=(3.176724206, 24.45910382)) >>> std Std_dev(statistic=2.9724954732045084, minmax=(1.7823367265645143, 4.945614605014631)) Now we generate some normally distributed random data, and get estimates of mean and standard deviation with 95% confidence intervals for those estimates: >>> n_samples = 100000 >>> data = stats.norm.rvs(size=n_samples) >>> res_mean, res_var, res_std = stats.bayes_mvs(data, alpha=0.95) >>> import matplotlib.pyplot as plt >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.hist(data, bins=100, density=True, label='Histogram of data') >>> ax.vlines(res_mean.statistic, 0, 0.5, colors='r', label='Estimated mean') >>> ax.axvspan(res_mean.minmax[0],res_mean.minmax[1], facecolor='r', ... alpha=0.2, label=r'Estimated mean (95% limits)') >>> ax.vlines(res_std.statistic, 0, 0.5, colors='g', label='Estimated scale') >>> ax.axvspan(res_std.minmax[0],res_std.minmax[1], facecolor='g', alpha=0.2, ... label=r'Estimated scale (95% limits)') >>> ax.legend(fontsize=10) >>> ax.set_xlim([-4, 4]) >>> ax.set_ylim([0, 0.5]) >>> plt.show() r!rz%0 < alpha < 1 is required, but alpha=z was given.)r2 ValueErrorrPmeanintervalrSrT)dataalphamvsm_resv_ress_ress \/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/scipy/stats/_morestats.pyr3r32spdmGAq! zUaZA5(+NOO 1::e, -E QVVXqzz%0 1E AFFHajj/ 0E % ct|}t|}|dkr td|j}|j }|dkDrt j |tj||z }t j tj|tj|d|zz }t j |tjd|z |z}n|dz }||zdz } |dz } t j||tj||z }t j| dtj| }t j| | }|||fS) a 'Frozen' distributions for mean, variance, and standard deviation of data. Parameters ---------- data : array_like Input array. Converted to 1-D using ravel. Requires 2 or more data-points. Returns ------- mdist : "frozen" distribution object Distribution object representing the mean of the data. vdist : "frozen" distribution object Distribution object representing the variance of the data. sdist : "frozen" distribution object Distribution object representing the standard deviation of the data. See Also -------- bayes_mvs Notes ----- The return values from ``bayes_mvs(data)`` is equivalent to ``tuple((x.mean(), x.interval(0.90)) for x in mvsdist(data))``. In other words, calling ``.mean()`` and ``.interval(0.90)`` on the three distribution objects returned from this function will give the same results that are returned from `bayes_mvs`. References ---------- T.E. Oliphant, "A Bayesian perspective on estimating mean, variance, and standard-deviation from data", https://scholarsarchive.byu.edu/facpub/278, 2006. Examples -------- >>> from scipy import stats >>> data = [6, 9, 12, 7, 8, 8, 13] >>> mean, var, std = stats.mvsdist(data) We now have frozen distribution objects "mean", "var" and "std" that we can examine: >>> mean.mean() 9.0 >>> mean.interval(0.95) (6.6120585482655692, 11.387941451734431) >>> mean.std() 1.1952286093343936 zNeed at least 2 data-points.i)locscale@r!)rf) rlenrVrWvarr.normmathrtgengammainvgamma) rYxnxbarCmdistsdistvdistnm1facvals rar2r2s.n d A AA1u788 668D A4x""t499QU3CD""tyy|499Q"q&\;RS""q #'0BQ0FG!e!ebjBhTYYq3w5GH&&sBdiinE&&s#6 % rbc|SNrps rar~arbc|fSr{r|rp_s rar~r~qdrb)result_to_tuple n_outputs default_axisrdaxisct|}|j|}|dkDs|dkr tdt|}||j |d}d}t |||}dgt d|dzDcgc]}|j||z|c}z}|dk(r |dd z|z S|d k(r||d z|dd zz ||d z zz S|d k(r8d |dd zzd |z|dz|d zz ||z|d zz||d z z|d z zz S|dk(rtd |ddzzd|z|dd zz|d zzd |z|d z z|d d zzz d|z|dzz|dz|d zz ||z|dzz|dzz||d z z|d z z|dz zz Stdcc}w)a7 Return the `n` th k-statistic ( ``1<=n<=4`` so far). The `n` th k-statistic ``k_n`` is the unique symmetric unbiased estimator of the `n` th cumulant :math:`\kappa_n` [1]_ [2]_. Parameters ---------- data : array_like Input array. n : int, {1, 2, 3, 4}, optional Default is equal to 2. axis : int or None, default: None If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If ``None``, the input will be raveled before computing the statistic. Returns ------- kstat : float The `n` th k-statistic. See Also -------- kstatvar : Returns an unbiased estimator of the variance of the k-statistic moment : Returns the n-th central moment about the mean for a sample. Notes ----- For a sample size :math:`n`, the first few k-statistics are given by .. math:: k_1 &= \frac{S_1}{n}, \\ k_2 &= \frac{nS_2 - S_1^2}{n(n-1)}, \\ k_3 &= \frac{2S_1^3 - 3nS_1S_2 + n^2S_3}{n(n-1)(n-2)}, \\ k_4 &= \frac{-6S_1^4 + 12nS_1^2S_2 - 3n(n-1)S_2^2 - 4n(n+1)S_1S_3 + n^2(n+1)S_4}{n (n-1)(n-2)(n-3)}, where .. math:: S_r \equiv \sum_{i=1}^n X_i^r, and :math:`X_i` is the :math:`i` th data point. References ---------- .. [1] http://mathworld.wolfram.com/k-Statistic.html .. [2] http://mathworld.wolfram.com/Cumulant.html Examples -------- >>> from scipy import stats >>> from numpy.random import default_rng >>> rng = default_rng() As sample size increases, `n`-th moment and `n`-th k-statistic converge to the same number (although they aren't identical). In the case of the normal distribution, they converge to zero. >>> for i in range(2,8): ... x = rng.normal(size=10**i) ... m, k = stats.moment(x, 3), stats.kstat(x, 3) ... print(f"{i=}: {m=:.3g}, {k=:.3g}, {(m-k)=:.3g}") i=2: m=-0.631, k=-0.651, (m-k)=0.0194 # random i=3: m=0.0282, k=0.0283, (m-k)=-8.49e-05 i=4: m=-0.0454, k=-0.0454, (m-k)=1.36e-05 i=6: m=7.53e-05, k=7.53e-05, (m-k)=-2.26e-09 i=7: m=0.00166, k=0.00166, (m-k)=-4.99e-09 i=8: m=-2.88e-06 k=-2.88e-06, (m-k)=8.63e-13 r!z'k-statistics only supported for 1<=n<=4Nrxpr?rdrgi @zShould not be here.)rr rVintreshaper,rangesum)rYrqrrNkSs rar4r4s^  B ::d D1uABCC AA |zz$&$,A eAq1uoF"&&qt&,FFAAvtcz!| a!A$1s"q!c'{33 a!A$' AaC!HQqTM)AaC!H4AGa#g9NOO aAaD!Gbd1Q47lQqT11AaC3K!a4GG1ac1Q4!$%'(sAaCy1~6AcEAcE"AcE*, -.//Gs4Fc|Sr{r|r}s rar~r~Jrrbc|fSr{r|rs rar~r~JrrbcLt|}|j|}||j|d}d}t|||}|dk(rt |d|ddz|z S|dk(r;t |d|d}t |d |d}d|z|dzz|dz |zz||dzzz St d ) azReturn an unbiased estimator of the variance of the k-statistic. See `kstat` and [1]_ for more details about the k-statistic. Parameters ---------- data : array_like Input array. n : int, {1, 2}, optional Default is equal to 2. axis : int or None, default: None If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If ``None``, the input will be raveled before computing the statistic. Returns ------- kstatvar : float The `n` th k-statistic variance. See Also -------- kstat : Returns the n-th k-statistic. moment : Returns the n-th central moment about the mean for a sample. Notes ----- Unbiased estimators of the variances of the first two k-statistics are given by .. math:: \mathrm{var}(k_1) &= \frac{k_2}{n}, \\ \mathrm{var}(k_2) &= \frac{2k_2^2n + (n-1)k_4}{n(n - 1)}. References ---------- .. [1] http://mathworld.wolfram.com/k-Statistic.html rrrr!rdT)rqr_no_decorrzOnly n=1 or n=2 supported.)rr rr,r4rV)rYrqrrrk2k4s rar5r5IsX  B ::d D |zz$&$,AAvTQTD9C?AA a 414$ 7 414$ 7!BE QqS"H$AaC11566rbctj|tj}dd|z z|d<d|dz |d<tjd|}|dz |d zz |dd|S) aIApproximations of uniform order statistic medians. Parameters ---------- n : int Sample size. Returns ------- v : 1d float array Approximations of the order statistic medians. References ---------- .. [1] James J. Filliben, "The Probability Plot Correlation Coefficient Test for Normality", Technometrics, Vol. 17, pp. 111-117, 1975. Examples -------- Order statistics of the uniform distribution on the unit interval are marginally distributed according to beta distributions. The expectations of these order statistic are evenly spaced across the interval, but the distributions are skewed in a way that pushes the medians slightly towards the endpoints of the unit interval: >>> import numpy as np >>> n = 4 >>> k = np.arange(1, n+1) >>> from scipy.stats import beta >>> a = k >>> b = n-k+1 >>> beta.mean(a, b) array([0.2, 0.4, 0.6, 0.8]) >>> beta.median(a, b) array([0.15910358, 0.38572757, 0.61427243, 0.84089642]) The Filliben approximation uses the exact medians of the smallest and greatest order statistics, and the remaining medians are approximated by points spread evenly across a sub-interval of the unit interval: >>> from scipy.stats._morestats import _calc_uniform_order_statistic_medians >>> _calc_uniform_order_statistic_medians(n) array([0.15910358, 0.38545246, 0.61454754, 0.84089642]) This plot shows the skewed distributions of the order statistics of a sample of size four from a uniform distribution on the unit interval: >>> import matplotlib.pyplot as plt >>> x = np.linspace(0.0, 1.0, num=50, endpoint=True) >>> pdfs = [beta.pdf(x, a[i], b[i]) for i in range(n)] >>> plt.figure() >>> plt.plot(x, pdfs[0], x, pdfs[1], x, pdfs[2], x, pdfs[3]) dtype?rrr!rrdgRQ?g\(\?)npemptyfloat64r )rqr\is ra%_calc_uniform_order_statistic_mediansrsgn "**%A #'NAbE qu9AaD !QA6za%i(AaG HrbTct|tr |St|tr tt|}|S|r d}t ||S#t $r}t |d|d}~wwxYw)aParse `dist` keyword. Parameters ---------- dist : str or stats.distributions instance. Several functions take `dist` as a keyword, hence this utility function. enforce_subclass : bool, optional If True (default), `dist` needs to be a `_distn_infrastructure.rv_generic` instance. It can sometimes be useful to set this keyword to False, if a function wants to accept objects that just look somewhat like such an instance (for example, they have a ``ppf`` method). z! is not a valid distribution nameNza`dist` should be a stats.distributions instance or a string with the name of such a distribution.) isinstancer/strgetattrr.AttributeErrorrV)distenforce_subclassemsgs ra_parse_dist_kwrs $ #  K D#  P=$/D K 7o K Pv%FGHa O PsA A%A  A%c  t|dr4|j||j||j|y|j ||j ||j |y#t$rYywxYw)z>Helper function to add axes labels and a title to stats plots. set_titleN)hasattrr set_xlabel set_ylabeltitlexlabelylabel Exception)plotrrrs ra_add_axis_labels_titlerso 4 % NN5 ! OOF # OOF # JJu  KK  KK     s?A63A66 BBFctj|}|jdk(r+|r%||ftjtjdffS||fSt t |}t |d}|d}t|r|f}t|ts t|}|j|g|}t|}|rtj||\} } } } } ||j||d|r|j| |z zdt|dd d |r_|r]t!|}t#|}t!|}t#|}|d ||z zz}|d ||z zz}|j%||d dzdd|r ||f   ffS||fS)a Calculate quantiles for a probability plot, and optionally show the plot. Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). `probplot` optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function. Parameters ---------- x : array_like Sample/response data from which `probplot` creates the plot. sparams : tuple, optional Distribution-specific shape parameters (shape parameters plus location and scale). dist : str or stats.distributions instance, optional Distribution or distribution function name. The default is 'norm' for a normal probability plot. Objects that look enough like a stats.distributions instance (i.e. they have a ``ppf`` method) are also accepted. fit : bool, optional Fit a least-squares regression (best-fit) line to the sample data if True (default). plot : object, optional If given, plots the quantiles. If given and `fit` is True, also plots the least squares fit. `plot` is an object that has to have methods "plot" and "text". The `matplotlib.pyplot` module or a Matplotlib Axes object can be used, or a custom object with the same methods. Default is None, which means that no plot is created. rvalue : bool, optional If `plot` is provided and `fit` is True, setting `rvalue` to True includes the coefficient of determination on the plot. Default is False. Returns ------- (osm, osr) : tuple of ndarrays Tuple of theoretical quantiles (osm, or order statistic medians) and ordered responses (osr). `osr` is simply sorted input `x`. For details on how `osm` is calculated see the Notes section. (slope, intercept, r) : tuple of floats, optional Tuple containing the result of the least-squares fit, if that is performed by `probplot`. `r` is the square root of the coefficient of determination. If ``fit=False`` and ``plot=None``, this tuple is not returned. Notes ----- Even if `plot` is given, the figure is not shown or saved by `probplot`; ``plt.show()`` or ``plt.savefig('figname.png')`` should be used after calling `probplot`. `probplot` generates a probability plot, which should not be confused with a Q-Q or a P-P plot. Statsmodels has more extensive functionality of this type, see ``statsmodels.api.ProbPlot``. The formula used for the theoretical quantiles (horizontal axis of the probability plot) is Filliben's estimate:: quantiles = dist.ppf(val), for 0.5**(1/n), for i = n val = (i - 0.3175) / (n + 0.365), for i = 2, ..., n-1 1 - 0.5**(1/n), for i = 1 where ``i`` indicates the i-th ordered value and ``n`` is the total number of values. Examples -------- >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt >>> nsample = 100 >>> rng = np.random.default_rng() A t distribution with small degrees of freedom: >>> ax1 = plt.subplot(221) >>> x = stats.t.rvs(3, size=nsample, random_state=rng) >>> res = stats.probplot(x, plot=plt) A t distribution with larger degrees of freedom: >>> ax2 = plt.subplot(222) >>> x = stats.t.rvs(25, size=nsample, random_state=rng) >>> res = stats.probplot(x, plot=plt) A mixture of two normal distributions with broadcasting: >>> ax3 = plt.subplot(223) >>> x = stats.norm.rvs(loc=[0,5], scale=[1,1.5], ... size=(nsample//2,2), random_state=rng).ravel() >>> res = stats.probplot(x, plot=plt) A standard normal distribution: >>> ax4 = plt.subplot(224) >>> x = stats.norm.rvs(loc=0, scale=1, size=nsample, random_state=rng) >>> res = stats.probplot(x, plot=plt) Produce a new figure with a loggamma distribution, using the ``dist`` and ``sparams`` keywords: >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> x = stats.loggamma.rvs(c=2.5, size=500, random_state=rng) >>> res = stats.probplot(x, dist=stats.loggamma, sparams=(2.5,), plot=ax) >>> ax.set_title("Probplot for loggamma dist with shape parameter 2.5") Show the results with Matplotlib: >>> plt.show() rF)rr|bozr-zTheoretical quantileszOrdered ValueszProbability Plotrrrgffffff?{Gz?z$R^2=rdz1.4f$)rr sizenanrrirrrtupleppfr r$ linregressrrr rtext)rpsparamsrfitrrvalue osm_uniformosmosrslope interceptrprobrxminxmaxyminymaxposxposys rar6r6sj 1 Avv{ q6BFFBFFC00 0a4K7A?K $ 7D* gu %. $((; ) )C q'C '0';';C'E$y!T1  #sD!  IIc59y0$ 7t,C&6%7 9 69D9D7D7D$$+..D$$+..D IIdDE!q&a"8 9 SzE9a000Cxrbct|}tt|}t|}d}t j |||||j fS)a Calculate the shape parameter that maximizes the PPCC. The probability plot correlation coefficient (PPCC) plot can be used to determine the optimal shape parameter for a one-parameter family of distributions. ``ppcc_max`` returns the shape parameter that would maximize the probability plot correlation coefficient for the given data to a one-parameter family of distributions. Parameters ---------- x : array_like Input array. brack : tuple, optional Triple (a,b,c) where (a>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt >>> rng = np.random.default_rng() >>> c = 2.5 >>> x = stats.weibull_min.rvs(c, scale=4, size=2000, random_state=rng) Generate the PPCC plot for this data with the Weibull distribution. >>> fig, ax = plt.subplots(figsize=(8, 6)) >>> res = stats.ppcc_plot(x, c/2, 2*c, dist='weibull_min', plot=ax) We calculate the value where the shape should reach its maximum and a red line is drawn there. The line should coincide with the highest point in the PPCC graph. >>> cmax = stats.ppcc_max(x, brack=(c/2, 2*c), dist='weibull_min') >>> ax.axvline(cmax, color='r') >>> plt.show() cP|||}tj||\}}d|z SNr!)r$pearsonr)shapemiyvalsfuncxvalsrrs ratempfunczppcc_max..tempfuncs,R$$UE241u rbbrackargs)rrrir rbrentr)rprrrrrs rar7r7sPN $ D7A?K q'C  >>(% +S$((; ==rbc.||kr tdtj|||}tj|}t |D]\}} t || |d\} } | d||<!|&|j ||dt|ddd |d  ||fS) aT Calculate and optionally plot probability plot correlation coefficient. The probability plot correlation coefficient (PPCC) plot can be used to determine the optimal shape parameter for a one-parameter family of distributions. It cannot be used for distributions without shape parameters (like the normal distribution) or with multiple shape parameters. By default a Tukey-Lambda distribution (`stats.tukeylambda`) is used. A Tukey-Lambda PPCC plot interpolates from long-tailed to short-tailed distributions via an approximately normal one, and is therefore particularly useful in practice. Parameters ---------- x : array_like Input array. a, b : scalar Lower and upper bounds of the shape parameter to use. dist : str or stats.distributions instance, optional Distribution or distribution function name. Objects that look enough like a stats.distributions instance (i.e. they have a ``ppf`` method) are also accepted. The default is ``'tukeylambda'``. plot : object, optional If given, plots PPCC against the shape parameter. `plot` is an object that has to have methods "plot" and "text". The `matplotlib.pyplot` module or a Matplotlib Axes object can be used, or a custom object with the same methods. Default is None, which means that no plot is created. N : int, optional Number of points on the horizontal axis (equally distributed from `a` to `b`). Returns ------- svals : ndarray The shape values for which `ppcc` was calculated. ppcc : ndarray The calculated probability plot correlation coefficient values. See Also -------- ppcc_max, probplot, boxcox_normplot, tukeylambda References ---------- J.J. Filliben, "The Probability Plot Correlation Coefficient Test for Normality", Technometrics, Vol. 17, pp. 111-117, 1975. Examples -------- First we generate some random data from a Weibull distribution with shape parameter 2.5, and plot the histogram of the data: >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt >>> rng = np.random.default_rng() >>> c = 2.5 >>> x = stats.weibull_min.rvs(c, scale=4, size=2000, random_state=rng) Take a look at the histogram of the data. >>> fig1, ax = plt.subplots(figsize=(9, 4)) >>> ax.hist(x, bins=50) >>> ax.set_title('Histogram of x') >>> plt.show() Now we explore this data with a PPCC plot as well as the related probability plot and Box-Cox normplot. A red line is drawn where we expect the PPCC value to be maximal (at the shape parameter ``c`` used above): >>> fig2 = plt.figure(figsize=(12, 4)) >>> ax1 = fig2.add_subplot(1, 3, 1) >>> ax2 = fig2.add_subplot(1, 3, 2) >>> ax3 = fig2.add_subplot(1, 3, 3) >>> res = stats.probplot(x, plot=ax1) >>> res = stats.boxcox_normplot(x, -4, 4, plot=ax2) >>> res = stats.ppcc_plot(x, c/2, 2*c, dist='weibull_min', plot=ax3) >>> ax3.axvline(c, color='r') >>> plt.show() z`b` has to be larger than `a`.numTrrrrpz Shape ValuesProb Plot Corr. Coef.(z ) PPCC Plotr)rVrlinspace empty_like enumerater6rr) rpabrrrsvalsppccrsvalrr2s rar8r8sj Av9:: KK1! $E == DU#4Dt62R&Q  %s#tN&='(k%: < $;rbcxtj||dtj|j|z S)NT)rkeepdims)r logsumexprlrr)logxrs ra _log_meanrYs5 $TD9 ((4::d# $ %rbcr|jt|||j}|j|}t j |j ||fdd|j || fdd\}}t j d|z|tj|j|z S)NrrT)rr return_signrd) broadcast_torr ones_likerrstackrlr)rrrlogmeanoneslogxmurs ra_log_varrasooi48$**EG << D!!"((D'?("C!$&HHdTE]H$CQUWIFA   QZd 3dhhtzz$?O6P PPrb propagate)rr nan_policycJi}|dur|||<|dk7r|||<t|f||d|S)aThe boxcox log-likelihood function. Parameters ---------- lmb : scalar Parameter for Box-Cox transformation. See `boxcox` for details. data : array_like Data to calculate Box-Cox log-likelihood for. If `data` is multi-dimensional, the log-likelihood is calculated along the first axis. axis : int, default: 0 If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If ``None``, the input will be raveled before computing the statistic. nan_policy : {'propagate', 'omit', 'raise' Defines how to handle input NaNs. - ``propagate``: if a NaN is present in the axis slice (e.g. row) along which the statistic is computed, the corresponding entry of the output will be NaN. - ``omit``: NaNs will be omitted when performing the calculation. If insufficient data remains in the axis slice along which the statistic is computed, the corresponding entry of the output will be NaN. - ``raise``: if a NaN is present, a ``ValueError`` will be raised. keepdims : bool, default: False If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array. Returns ------- llf : float or ndarray Box-Cox log-likelihood of `data` given `lmb`. A float for 1-D `data`, an array otherwise. See Also -------- boxcox, probplot, boxcox_normplot, boxcox_normmax Notes ----- The Box-Cox log-likelihood function :math:`l` is defined here as .. math:: l = (\lambda - 1) \sum_i^N \log(x_i) - \frac{N}{2} \log\left(\sum_i^N (y_i - \bar{y})^2 / N\right), where :math:`N` is the number of data points ``data`` and :math:`y` is the Box-Cox transformed input data. This corresponds to the *profile log-likelihood* of the original data :math:`x` with some constant terms dropped. Examples -------- >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt >>> from mpl_toolkits.axes_grid1.inset_locator import inset_axes Generate some random variates and calculate Box-Cox log-likelihood values for them for a range of ``lmbda`` values: >>> rng = np.random.default_rng() >>> x = stats.loggamma.rvs(5, loc=10, size=1000, random_state=rng) >>> lmbdas = np.linspace(-2, 10) >>> llf = np.zeros(lmbdas.shape, dtype=float) >>> for ii, lmbda in enumerate(lmbdas): ... llf[ii] = stats.boxcox_llf(lmbda, x) Also find the optimal lmbda value with `boxcox`: >>> x_most_normal, lmbda_optimal = stats.boxcox(x) Plot the log-likelihood as function of lmbda. Add the optimal lmbda as a horizontal line to check that that's really the optimum: >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.plot(lmbdas, llf, 'b.-') >>> ax.axhline(stats.boxcox_llf(lmbda_optimal, x), color='r') >>> ax.set_xlabel('lmbda parameter') >>> ax.set_ylabel('Box-Cox log-likelihood') Now add some probability plots to show that where the log-likelihood is maximized the data transformed with `boxcox` looks closest to normal: >>> locs = [3, 10, 4] # 'lower left', 'center', 'lower right' >>> for lmbda, loc in zip([-1, lmbda_optimal, 9], locs): ... xt = stats.boxcox(x, lmbda=lmbda) ... (osm, osr), (slope, intercept, r_sq) = stats.probplot(xt) ... ax_inset = inset_axes(ax, width="20%", height="20%", loc=loc) ... ax_inset.plot(osm, osr, 'c.', osm, slope*osm + intercept, 'k-') ... ax_inset.set_xticklabels([]) ... ax_inset.set_yticklabels([]) ... ax_inset.set_title(r'$\lambda=%1.2f$' % lmbda) >>> plt.show() Fr)lmbr) _boxcox_llf)rrYrrrkwargss rar9r9jsEPFu#x[ 'z t :4 :6 ::rbc|Sr{r|r}s rar~r~sArbc|fSr{r|rs rar~r~strb)rrrc t|}t||d|\}}|j|}|dk(r t||S|j |}|dk(r#|j |j ||}n4||z}t |||dtjt|zz }|dz |j||z|dz |zz }|j||jd }|jdk(r|d }|S|}|S) NTforce_floatingrrrrrdr!Fcopyr|) rr rrrrjrrlabsrastyperndim) rYrrrrlogdatalogvarrress rar r s   B3TbAIC 4AAv$$ffTlG axwT23 W}$D)AS0B,BB 7bffW4f0 01Q3< ?C ))C%) 0CXX]#b'C J),C Jrbcdtjjd|z dz}t|||z }d}|dz}d}||||dkDr"|dkr|dz }|dz }||||dkDr|dkr|dk(r t dt j |||||f }|dz }d}||||dkDr"|dkr|dz}|dz }||||dkDr|dkr|dk(r t dt j |||||f } | |fS) Nrr!c t|||z Sr{r9)lmbdarYtargets rarootfuncz'_boxcox_conf_interval..rootfuncs%&//rbrri皙?zCould not find endpoint.r)r.chi2rr9 RuntimeErrorrbrentq) rplmaxrZrxrrnewlmrlmpluslmminuss ra_boxcox_conf_intervalr'sC  ""&&q5y!4 4C a 3 &F0 3JE A E1f % +!c'   Q E1f % +!c' Cx566 __XtU!V EF 3JE A E1f % +!c'   Q E1f % +!c' Cx566ooht1f+FG F?rbctj|}|tj||S|jdk7r t d|j dk(r|Stj||dk(r t dtj|dkr t dt|d|}t||}|||fSt|||}|||fS)aReturn a dataset transformed by a Box-Cox power transformation. Parameters ---------- x : ndarray Input array to be transformed. If `lmbda` is not None, this is an alias of `scipy.special.boxcox`. Returns nan if ``x < 0``; returns -inf if ``x == 0 and lmbda < 0``. If `lmbda` is None, array must be positive, 1-dimensional, and non-constant. lmbda : scalar, optional If `lmbda` is None (default), find the value of `lmbda` that maximizes the log-likelihood function and return it as the second output argument. If `lmbda` is not None, do the transformation for that value. alpha : float, optional If `lmbda` is None and `alpha` is not None (default), return the ``100 * (1-alpha)%`` confidence interval for `lmbda` as the third output argument. Must be between 0.0 and 1.0. If `lmbda` is not None, `alpha` is ignored. optimizer : callable, optional If `lmbda` is None, `optimizer` is the scalar optimizer used to find the value of `lmbda` that minimizes the negative log-likelihood function. `optimizer` is a callable that accepts one argument: fun : callable The objective function, which evaluates the negative log-likelihood function at a provided value of `lmbda` and returns an object, such as an instance of `scipy.optimize.OptimizeResult`, which holds the optimal value of `lmbda` in an attribute `x`. See the example in `boxcox_normmax` or the documentation of `scipy.optimize.minimize_scalar` for more information. If `lmbda` is not None, `optimizer` is ignored. Returns ------- boxcox : ndarray Box-Cox power transformed array. maxlog : float, optional If the `lmbda` parameter is None, the second returned argument is the `lmbda` that maximizes the log-likelihood function. (min_ci, max_ci) : tuple of float, optional If `lmbda` parameter is None and `alpha` is not None, this returned tuple of floats represents the minimum and maximum confidence limits given `alpha`. See Also -------- probplot, boxcox_normplot, boxcox_normmax, boxcox_llf Notes ----- The Box-Cox transform is given by: .. math:: y = \begin{cases} \frac{x^\lambda - 1}{\lambda}, &\text{for } \lambda \neq 0 \log(x), &\text{for } \lambda = 0 \end{cases} `boxcox` requires the input data to be positive. Sometimes a Box-Cox transformation provides a shift parameter to achieve this; `boxcox` does not. Such a shift parameter is equivalent to adding a positive constant to `x` before calling `boxcox`. The confidence limits returned when `alpha` is provided give the interval where: .. math:: l(\hat{\lambda}) - l(\lambda) < \frac{1}{2}\chi^2(1 - \alpha, 1), with :math:`l` the log-likelihood function and :math:`\chi^2` the chi-squared function. References ---------- G.E.P. Box and D.R. Cox, "An Analysis of Transformations", Journal of the Royal Statistical Society B, 26, 211-252 (1964). Examples -------- >>> from scipy import stats >>> import matplotlib.pyplot as plt We generate some random variates from a non-normal distribution and make a probability plot for it, to show it is non-normal in the tails: >>> fig = plt.figure() >>> ax1 = fig.add_subplot(211) >>> x = stats.loggamma.rvs(5, size=500) + 5 >>> prob = stats.probplot(x, dist=stats.norm, plot=ax1) >>> ax1.set_xlabel('') >>> ax1.set_title('Probplot against normal distribution') We now use `boxcox` to transform the data so it's closest to normal: >>> ax2 = fig.add_subplot(212) >>> xt, _ = stats.boxcox(x) >>> prob = stats.probplot(xt, dist=stats.norm, plot=ax2) >>> ax2.set_title('Probplot after Box-Cox transformation') >>> plt.show() r!zData must be 1-dimensional.rzData must not be constant.Data must be positive.mle)method optimizer) rr rr:rrVrallanyr;r')rprrZr,r#yrXs rar:r:sn 1 A ~~a''vv{677vv{ vva1Q4i566 vva1f~122 !EY ?Dq$A }$w)D%8$  rbctj|d|z z tj|z|z d}tj| tj|z d|z z S)Nr)rr!)rlambertwrrreal)rpr/rs ra_boxcox_inv_lmbdar3sX   Q26]+bffQi7!;r BC 77C4"&&)#a!e+ ,,rbceZdZdZy) _BigFloatcy)N BIG_FLOATr|selfs ra__repr__z_BigFloat.__repr__srbN)__name__ __module__ __qualname__r:r|rbrar5r5srbr5)rcltj|}tjtj||dk\zs d}t |d}|t urstj |jtjr |jntj}tj|jdz }d|d}n|dkr t d dfd n(ts t d  t d fd fd fdfd}|d} || jvrt d|d| |} | |} | d}t |tj|s4tj|tj|} } | dk\r| }nc| dkr| }n[t!j"| | t%t!j"| | kD}t'| tj(r|d}|r| n| }t%t!j"|| |kD}tj*|rhd| d|z}t-j.|dt1||tj2|dz z}t'| tj(r|| |<| S|} | S)aCompute optimal Box-Cox transform parameter for input data. Parameters ---------- x : array_like Input array. All entries must be positive, finite, real numbers. brack : 2-tuple, optional, default (-2.0, 2.0) The starting interval for a downhill bracket search for the default `optimize.brent` solver. Note that this is in most cases not critical; the final result is allowed to be outside this bracket. If `optimizer` is passed, `brack` must be None. method : str, optional The method to determine the optimal transform parameter (`boxcox` ``lmbda`` parameter). Options are: 'pearsonr' (default) Maximizes the Pearson correlation coefficient between ``y = boxcox(x)`` and the expected values for ``y`` if `x` would be normally-distributed. 'mle' Maximizes the log-likelihood `boxcox_llf`. This is the method used in `boxcox`. 'all' Use all optimization methods available, and return all results. Useful to compare different methods. optimizer : callable, optional `optimizer` is a callable that accepts one argument: fun : callable The objective function to be minimized. `fun` accepts one argument, the Box-Cox transform parameter `lmbda`, and returns the value of the function (e.g., the negative log-likelihood) at the provided argument. The job of `optimizer` is to find the value of `lmbda` that *minimizes* `fun`. and returns an object, such as an instance of `scipy.optimize.OptimizeResult`, which holds the optimal value of `lmbda` in an attribute `x`. See the example below or the documentation of `scipy.optimize.minimize_scalar` for more information. ymax : float, optional The unconstrained optimal transform parameter may cause Box-Cox transformed data to have extreme magnitude or even overflow. This parameter constrains MLE optimization such that the magnitude of the transformed `x` does not exceed `ymax`. The default is the maximum value of the input dtype. If set to infinity, `boxcox_normmax` returns the unconstrained optimal lambda. Ignored when ``method='pearsonr'``. Returns ------- maxlog : float or ndarray The optimal transform parameter found. An array instead of a scalar for ``method='all'``. See Also -------- boxcox, boxcox_llf, boxcox_normplot, scipy.optimize.minimize_scalar Examples -------- >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt We can generate some data and determine the optimal ``lmbda`` in various ways: >>> rng = np.random.default_rng() >>> x = stats.loggamma.rvs(5, size=30, random_state=rng) + 5 >>> y, lmax_mle = stats.boxcox(x) >>> lmax_pearsonr = stats.boxcox_normmax(x) >>> lmax_mle 2.217563431465757 >>> lmax_pearsonr 2.238318660200961 >>> stats.boxcox_normmax(x, method='all') array([2.23831866, 2.21756343]) >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> prob = stats.boxcox_normplot(x, -10, 10, plot=ax) >>> ax.axvline(lmax_mle, color='r') >>> ax.axvline(lmax_pearsonr, color='g', ls='--') >>> plt.show() Alternatively, we can define our own `optimizer` function. Suppose we are only interested in values of `lmbda` on the interval [6, 7], we want to use `scipy.optimize.minimize_scalar` with ``method='bounded'``, and we want to use tighter tolerances when optimizing the log-likelihood function. To do this, we define a function that accepts positional argument `fun` and uses `scipy.optimize.minimize_scalar` to minimize `fun` subject to the provided bounds and tolerances: >>> from scipy import optimize >>> options = {'xatol': 1e-12} # absolute tolerance on `x` >>> def optimizer(fun): ... return optimize.minimize_scalar(fun, bounds=(6, 7), ... method="bounded", options=options) >>> stats.boxcox_normmax(x, optimizer=optimizer) 6.000000000 rzVThe `x` argument of `boxcox_normmax` must contain only positive, finite, real numbers.zexceed specified `ymax`.i'z overflow in .z `ymax` must be strictly positive)grgc4tj||S)N)rr)rr)rrrs ra _optimizerz"boxcox_normmax.._optimizer@s>>$T? ?rbz`optimizer` must be a callablez,`brack` must be None if `optimizer` is givenc:fd}t|ddS)Nc|gSr{r|)rprrs ra func_wrappedz8boxcox_normmax.._optimizer..func_wrappedNsA~~%rbrp)r)rrrDr,s`` rarAz"boxcox_normmax.._optimizerMs &9\2C> >rbctt|}tjj |}d}|||fS)Nct||}tj|}tj||\}}d|z Sr)r:rr r$r)rrsampsr/rrrs ra_eval_pearsonrz9boxcox_normmax.._pearsonr.._eval_pearsonrVs< ue$AGGAJE((6GAtq5Lrbr)rrir.rkr)rprrrHrAs ra _pearsonrz!boxcox_normmax.._pearsonrRs?;CFC ""&&{3 .qz::rbc d}||fS)Nct|| Sr{r)rrYs ra _eval_mlez/boxcox_normmax.._mle.._eval_mlecssD)) )rbrr|)rprLrAs ra_mlezboxcox_normmax.._mlebs *)1$//rbcjtjdt}||d<||d<|S)Nrdrrr!)rrfloat)rpmaxlogrMrIs ra_allzboxcox_normmax.._allis2!5)aLq Gq  rb)rr*r-zMethod z not recognized.zsThe `optimizer` argument of `boxcox_normmax` must return an object containing the optimal `lmbda` in attribute `x`.r!zThe optimal lambda is z, but the returned lambda is the constrained optimum to ensure that the maximum or the minimum of the transformed data does not rd stacklevel)rr r-isfiniterV_BigFloat_singleton issubdtyperfloatingrfinfomaxcallablekeysisinfminrr:rrndarrayr.warningswarnr3sign)rprr+r,rmessageend_msgrrQmethods optimfuncrrrx_treme indicatormaskconstrained_resrMrArIs ` ` @@@rar;r;s\ 1 A 66"++a.AF+ ,:!!(G ""=="++>BJJxx""U* q) ;<< =E @  "=> >  KL L ? ; 0 %GW\\^#76(*:;<<I A,C {P!! XXd^VVAYq d 19G QYGtS1CtS8Q4RRI#rzz*%aL 'dTG7>>'3/047 66$<(.457>?  MM'a 00RS @T9TUO#rzz*+D  J& Jrbc|dk(r d}t}nd}t}tj|}|jdk(r|S||kr t d|dk(r#tj |dkr t dtj|||}|dz} t|D])\} } ||| } t| d d \} \} } }|| | <+|"|j|| d t|dd||| fS)zCompute parameters for a Box-Cox or Yeo-Johnson normality plot, optionally show it. See `boxcox_normplot` or `yeojohnson_normplot` for details. r:zBox-Cox Normality PlotzYeo-Johnson Normality Plotrz `lb` has to be larger than `la`.r)rr)rrkTrrpz $\lambda$rr) r:rKrr rrVr.rrr6rr)r+rplalbrrrtransform_funclmbdasrrryzrrs ra _normplotrps (,# 1 Avv{ Rx;<< bffQ!Vn122 [[RQ 'F C>> from scipy import stats >>> import matplotlib.pyplot as plt Generate some non-normally distributed data, and create a Box-Cox plot: >>> x = stats.loggamma.rvs(5, size=500) + 5 >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> prob = stats.boxcox_normplot(x, -20, 20, plot=ax) Determine and plot the optimal ``lmbda`` to transform ``x`` and plot it in the same plot: >>> _, maxlog = stats.boxcox(x) >>> ax.axvline(maxlog, color='r') >>> plt.show() r:rprprkrlrrs rar<r<sB Xq"b$ 22rbctj|}|jdk(r|Stj|jtj r t dtj|jtjr!|jtjd}| t||St|}t||}||fS)a* Return a dataset transformed by a Yeo-Johnson power transformation. Parameters ---------- x : ndarray Input array. Should be 1-dimensional. lmbda : float, optional If ``lmbda`` is ``None``, find the lambda that maximizes the log-likelihood function and return it as the second output argument. Otherwise the transformation is done for the given value. Returns ------- yeojohnson: ndarray Yeo-Johnson power transformed array. maxlog : float, optional If the `lmbda` parameter is None, the second returned argument is the lambda that maximizes the log-likelihood function. See Also -------- probplot, yeojohnson_normplot, yeojohnson_normmax, yeojohnson_llf, boxcox Notes ----- The Yeo-Johnson transform is given by: .. math:: y = \begin{cases} \frac{(x + 1)^\lambda - 1}{\lambda}, &\text{for } x \geq 0, \lambda \neq 0 \\ \log(x + 1), &\text{for } x \geq 0, \lambda = 0 \\ -\frac{(-x + 1)^{2 - \lambda} - 1}{2 - \lambda}, &\text{for } x < 0, \lambda \neq 2 \\ -\log(-x + 1), &\text{for } x < 0, \lambda = 2 \end{cases} Unlike `boxcox`, `yeojohnson` does not require the input data to be positive. .. versionadded:: 1.2.0 References ---------- I. Yeo and R.A. Johnson, "A New Family of Power Transformations to Improve Normality or Symmetry", Biometrika 87.4 (2000): Examples -------- >>> from scipy import stats >>> import matplotlib.pyplot as plt We generate some random variates from a non-normal distribution and make a probability plot for it, to show it is non-normal in the tails: >>> fig = plt.figure() >>> ax1 = fig.add_subplot(211) >>> x = stats.loggamma.rvs(5, size=500) + 5 >>> prob = stats.probplot(x, dist=stats.norm, plot=ax1) >>> ax1.set_xlabel('') >>> ax1.set_title('Probplot against normal distribution') We now use `yeojohnson` to transform the data so it's closest to normal: >>> ax2 = fig.add_subplot(212) >>> xt, lmbda = stats.yeojohnson(x) >>> prob = stats.probplot(xt, dist=stats.norm, plot=ax2) >>> ax2.set_title('Probplot after Yeo-Johnson transformation') >>> plt.show() rz>Yeo-Johnson transformation is not defined for complex numbers.Fr) rr rrVrcomplexfloatingrVintegerrr_yeojohnson_transformrL)rprr#r/s rarKrK sd 1 Avv{ }}QWWb001,- - }}QWWbjj) HHRZZeH , $Q.. a Da&A d7Nrbctj|jtjr |jntj}tj ||}|dk\}t |tjdkrtj||||<n4tj|tj||z|z ||<t |dz tjdkDr@tjd|z tj|| z d|z z ||<|Stj||  ||<|S)zaReturns `x` transformed by the Yeo-Johnson power transform with given parameter `lmbda`. rrrrd) rrVrrWr zeros_likerspacinglog1pexpm1)rprroutposs rarwrwss}}QWWbkk:AGG E -- 'C q&C 5zBJJrN"88AcF#C88EBHHQsV$445=C 519~ 2&XXq5yBHHagX,>>??1u9MSD  JXXq#wh''SD Jrbc dtj|}|jd}|dk(rtjSt ||}|j d}tj |}|tj|jjk}tj||<| dz tj||z||<||xx|dz tj|tjtj|zjdzz cc<|S)ab The yeojohnson log-likelihood function. Parameters ---------- lmb : scalar Parameter for Yeo-Johnson transformation. See `yeojohnson` for details. data : array_like Data to calculate Yeo-Johnson log-likelihood for. If `data` is multi-dimensional, the log-likelihood is calculated along the first axis. Returns ------- llf : float Yeo-Johnson log-likelihood of `data` given `lmb`. See Also -------- yeojohnson, probplot, yeojohnson_normplot, yeojohnson_normmax Notes ----- The Yeo-Johnson log-likelihood function :math:`l` is defined here as .. math:: l = -\frac{N}{2} \log(\hat{\sigma}^2) + (\lambda - 1) \sum_i^N \text{sign}(x_i) \log(|x_i| + 1) where :math:`N` is the number of data points :math:`x`=``data`` and :math:`\hat{\sigma}^2` is the estimated variance of the Yeo-Johnson transformed input data :math:`x`. This corresponds to the *profile log-likelihood* of the original data :math:`x` with some constant terms dropped. .. versionadded:: 1.2.0 Examples -------- >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt >>> from mpl_toolkits.axes_grid1.inset_locator import inset_axes Generate some random variates and calculate Yeo-Johnson log-likelihood values for them for a range of ``lmbda`` values: >>> x = stats.loggamma.rvs(5, loc=10, size=1000) >>> lmbdas = np.linspace(-2, 10) >>> llf = np.zeros(lmbdas.shape, dtype=float) >>> for ii, lmbda in enumerate(lmbdas): ... llf[ii] = stats.yeojohnson_llf(lmbda, x) Also find the optimal lmbda value with `yeojohnson`: >>> x_most_normal, lmbda_optimal = stats.yeojohnson(x) Plot the log-likelihood as function of lmbda. Add the optimal lmbda as a horizontal line to check that that's really the optimum: >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.plot(lmbdas, llf, 'b.-') >>> ax.axhline(stats.yeojohnson_llf(lmbda_optimal, x), color='r') >>> ax.set_xlabel('lmbda parameter') >>> ax.set_ylabel('Yeo-Johnson log-likelihood') Now add some probability plots to show that where the log-likelihood is maximized the data transformed with `yeojohnson` looks closest to normal: >>> locs = [3, 10, 4] # 'lower left', 'center', 'lower right' >>> for lmbda, loc in zip([-1, lmbda_optimal, 9], locs): ... xt = stats.yeojohnson(x, lmbda=lmbda) ... (osm, osr), (slope, intercept, r_sq) = stats.probplot(xt) ... ax_inset = inset_axes(ax, width="20%", height="20%", loc=loc) ... ax_inset.plot(osm, osr, 'c.', osm, slope*osm + intercept, 'k-') ... ax_inset.set_xticklabels([]) ... ax_inset.set_yticklabels([]) ... ax_inset.set_title(r'$\lambda=%1.2f$' % lmbda) >>> plt.show() rrrdr!)rr rrrwrjrrXrtinyinfrrar{rr)rrY n_samplestrans trans_varloglike tiny_variances rarJrJsj ::d D 1 IA~vv !$ ,E q !ImmI&G 9 > >>MVVGM  Q =. 9:: ]N ]N qRWWT]RXXbffTl%;;@@a@HHJ Nrbc d}tjd5tjtj|s t dtj|dk(r dddy|"t j |||fcdddStj|}tj|jtjr |jntj}tjd tjtj|z}tjtj |j"}tjtj |j$|z d z }tjtj |j|zd z }||z }||z } tj|dkr d | z d |z } }n6tj&|dkrtd | z |t)d |z | } }d } t j*||| |f| cdddS#1swYyxYw) aCompute optimal Yeo-Johnson transform parameter. Compute optimal Yeo-Johnson transform parameter for input data, using maximum likelihood estimation. Parameters ---------- x : array_like Input array. brack : 2-tuple, optional The starting interval for a downhill bracket search with `optimize.brent`. Note that this is in most cases not critical; the final result is allowed to be outside this bracket. If None, `optimize.fminbound` is used with bounds that avoid overflow. Returns ------- maxlog : float The optimal transform parameter found. See Also -------- yeojohnson, yeojohnson_llf, yeojohnson_normplot Notes ----- .. versionadded:: 1.2.0 Examples -------- >>> import numpy as np >>> from scipy import stats >>> import matplotlib.pyplot as plt Generate some data and determine optimal ``lmbda`` >>> rng = np.random.default_rng() >>> x = stats.loggamma.rvs(5, size=30, random_state=rng) + 5 >>> lmax = stats.yeojohnson_normmax(x) >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> prob = stats.yeojohnson_normplot(x, -10, 10, plot=ax) >>> ax.axvline(lmax, color='r') >>> plt.show() cnt||}tj |tj|<| Sr{)rJrrr\)rrYllfs ra_neg_llfz$yeojohnson_normmax.._neg_llf&s/UD)!ffWBHHSMt rbignore)invalidz!Yeo-Johnson input must be finite.rNrrrdg`sbO>rxtol)rerrstater-rTrVrrr rVrrWrr{rYrrrXepsrr.r] fminbound) rprrr log1p_max_xlog_epslog_tiny_float log_max_floatrlub tol_brents rarLrLsb X &Ovvbkk!n%@A A 66!q&> OO  >>(%qdC OO JJqM=="++>BJJhhrBFF266!9$556 &&%,,-&&%!5!56@AE 3 34w>!C k ) [ ( 66!a%=VQVB VVAE]R_c!b&"oB !!(B!IN=OOOsA I0IF>II%c"td|||||S)aCompute parameters for a Yeo-Johnson normality plot, optionally show it. A Yeo-Johnson normality plot shows graphically what the best transformation parameter is to use in `yeojohnson` to obtain a distribution that is close to normal. Parameters ---------- x : array_like Input array. la, lb : scalar The lower and upper bounds for the ``lmbda`` values to pass to `yeojohnson` for Yeo-Johnson transformations. These are also the limits of the horizontal axis of the plot if that is generated. plot : object, optional If given, plots the quantiles and least squares fit. `plot` is an object that has to have methods "plot" and "text". The `matplotlib.pyplot` module or a Matplotlib Axes object can be used, or a custom object with the same methods. Default is None, which means that no plot is created. N : int, optional Number of points on the horizontal axis (equally distributed from `la` to `lb`). Returns ------- lmbdas : ndarray The ``lmbda`` values for which a Yeo-Johnson transform was done. ppcc : ndarray Probability Plot Correlation Coefficient, as obtained from `probplot` when fitting the Box-Cox transformed input `x` against a normal distribution. See Also -------- probplot, yeojohnson, yeojohnson_normmax, yeojohnson_llf, ppcc_max Notes ----- Even if `plot` is given, the figure is not shown or saved by `boxcox_normplot`; ``plt.show()`` or ``plt.savefig('figname.png')`` should be used after calling `probplot`. .. versionadded:: 1.2.0 Examples -------- >>> from scipy import stats >>> import matplotlib.pyplot as plt Generate some non-normally distributed data, and create a Yeo-Johnson plot: >>> x = stats.loggamma.rvs(5, size=500) + 5 >>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> prob = stats.yeojohnson_normplot(x, -20, 20, plot=ax) Determine and plot the optimal ``lmbda`` to transform ``x`` and plot it in the same plot: >>> _, maxlog = stats.yeojohnson(x) >>> ax.axvline(maxlog, color='r') >>> plt.show() rKrrrss rarMrMNsF \1b"dA 66rb ShapiroResult)rQpvalue)r too_smallrctj|jtj}t |}|dkr t dt |dztj}d}t|}|||dzz}t|||\}}}|dvrtjdd|d kDrtjd |d dttj|tj|S) a Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution. Parameters ---------- x : array_like Array of sample data. Must contain at least three observations. Returns ------- statistic : float The test statistic. p-value : float The p-value for the hypothesis test. See Also -------- anderson : The Anderson-Darling test for normality kstest : The Kolmogorov-Smirnov test for goodness of fit. :ref:`hypothesis_shapiro` : Extended example Notes ----- The algorithm used is described in [4]_ but censoring parameters as described are not implemented. For N > 5000 the W test statistic is accurate, but the p-value may not be. References ---------- .. [1] https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm :doi:`10.18434/M32189` .. [2] Shapiro, S. S. & Wilk, M.B, "An analysis of variance test for normality (complete samples)", Biometrika, 1965, Vol. 52, pp. 591-611, :doi:`10.2307/2333709` .. [3] Razali, N. M. & Wah, Y. B., "Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests", Journal of Statistical Modeling and Analytics, 2011, Vol. 2, pp. 21-33. .. [4] Royston P., "Remark AS R94: A Remark on Algorithm AS 181: The W-test for Normality", 1995, Applied Statistics, Vol. 44, :doi:`10.2307/2986146` Examples -------- >>> import numpy as np >>> from scipy import stats >>> rng = np.random.default_rng() >>> x = stats.norm.rvs(loc=5, scale=3, size=100, random_state=rng) >>> shapiro_test = stats.shapiro(x) >>> shapiro_test ShapiroResult(statistic=0.9813305735588074, pvalue=0.16855233907699585) >>> shapiro_test.statistic 0.9813305735588074 >>> shapiro_test.pvalue 0.16855233907699585 For a more detailed example, see :ref:`hypothesis_shapiro`. rzData must be at least length 3.rdrr)rrdzPscipy.stats.shapiro: Input data has range zero. The results may not be accurate.rRizVscipy.stats.shapiro: For N > 5000, computed p-value may not be accurate. Current N is r?) rrrrrirVr r r#r_r`r)rprrinitr/wpwifaults rar=r=s|  2::&A AA1u:;; ad"**%A D QA1a4LA!Q%MAr6 V 6BC E4x ;;<#Q@!" $ A 2 77rb)g;On?gˡE?gv/?gK7A`?gFx?)g/$?gsh|??g~jt?gV-?gZd;O?)gtV?gMb?MbX9?gMb?gS㥛?)g$C?jt?gQ?gS㥛?gˡE?g)\(?)g㥛 ?gHzG?gS?gNbX9?gX9v?獗n?gn?gn?)gzG?gK7?g/$?gw/?gV-?g5^I ?g ףp= ?g&1?)gOn?gn?gX9v?gJ +?gx&1?gK?g1Zd?gI +?)g$C?g&1?gx?gZd;O?g{Gz?gV-?g+?g5^I ?)gQ?g"~?g\(\?g rh?g?gx&1?gRQ?gZd;O?)g-?gl?gZd;?gS?gv/?g{Gz?gw/?g&1?)gjt?g~jt?gK7A?g= ףp=?goʡ?g/$?gK7?g{Gz?)g{Gz?gx&1?gS㥛?g-?g/$?gDl?gM?gx?)g!rh?gy&1?g/$?gA`"?rg|?5^?g^I +?gCl?)gK7A`?gjt?g/$?gGz?gCl?333333?gjt?g?)gS?gh|?5?rg'1Z?rgT㥛 ?g㥛 ?gy&1?r linearr)kind bounds_error fill_valuecH t |\}}} fd fd fd} fd}d}tj|tjs|dkrd|z}t | tj dd 5t j||dd } dddd  jd |z}| js t | | j\}}|||}|||fS#1swYQxYw#ttf$r} d |z}t || d} ~ wwxYw)Nc|z }d|z ||ztj|zj||zjz z tj|jz zSr)rrrr[uxurqrps radnllf_dmz$_weibull_fit_check..dnllf_dmsb qS!r1uRVVBZ',,.A{{}<<&&*.."1$% &rbc|z }|dz |z |dzjz||dz zjz||zjz z S)Nr!rrrs radnllf_duz$_weibull_fit_check..dnllf_du!sQ qS!QwB||~%2!9//*;(;RUKKM(IIIrbcB|z |zz jd|z zSrr)r[rrqrps ra get_scalez%_weibull_fit_check..get_scale&s)1q !AaC((rbc||gSr{r|)paramsrrs radnllfz!_weibull_fit_check..dnllf+s&!8V#455rbzMaximum likelihood estimation is known to be challenging for the three-parameter Weibull distribution. Consider performing a custom goodness-of-fit test using `scipy.stats.monte_carlo_test`.r!zMaximum likelihood estimation has converged to a solution in which the location is equal to the minimum of the data, the shape parameter is less than 2, or both. The table of critical values in [7] does not include this case. raise)overrrz/Solution of MLE first-order conditions failed: z. `anderson` cannot continue. zeAn error occurred while fitting the Weibull distribution to the data, so `anderson` cannot continue. ) rirallcloser]rVrrrootrbsuccessFloatingPointErrorrp)rrpr[rr]rr suggestionrbrrrrrqs ` @@@ra_weibull_fit_checkrsD AAGAq!& J ) 6 4J  {{1bffQi AE),6 6 !!)[[gw 7 4--vcr{3C 4Ekk]"@BDNO{{W% % 55DAq!QA a7N 4 4  +)BDNO!q()s0/C<C0 2C<0C95C<<D! DD!AndersonResult)rQcritical_valuessignificance_level fit_resultc  |j}|dvrd}hd}||vrtd|dt|}tj|d}t |}|dk(rtj |d d }||z |z }||f}tjj|} tjj|} tgd } ttd d |z zd|z |z z z d} nJ|dk(ro||z }d|f}tjj|} tjj|} tgd } ttd d|z zz d} n|dk(rd} t|tj |d d g}t!j"| |||fd}||dz |d z }|}tj$j|} tj$j|} tgd} tt&d d|z zz d} n|dk(rtj(j+|\}}||z |z }||f}tj(j|} tj(j|} tgd} tt,d dt/|z zz d} nx|dk(rtj0j+|\}}||z |z }||f}tj0j|} tj0j|} tgd} tt,d dt/|z zz d} n|dk(rd}|dkrt3j4|dtj6j+|\}}}t9|||f|\}}}|||f}t;j6|j|} t;j6|j|} d |z }tgd } t=|} tj>| d!zd"} tAd |d z}| tjBd|zd z |z   d#d#d$zzdz }d%}t!jDd&|'}tj|_#tItKt||d(|)}tM|  |*S)+aAnderson-Darling test for data coming from a particular distribution. The Anderson-Darling test tests the null hypothesis that a sample is drawn from a population that follows a particular distribution. For the Anderson-Darling test, the critical values depend on which distribution is being tested against. This function works for normal, exponential, logistic, weibull_min, or Gumbel (Extreme Value Type I) distributions. Parameters ---------- x : array_like Array of sample data. dist : {'norm', 'expon', 'logistic', 'gumbel', 'gumbel_l', 'gumbel_r', 'extreme1', 'weibull_min'}, optional The type of distribution to test against. The default is 'norm'. The names 'extreme1', 'gumbel_l' and 'gumbel' are synonyms for the same distribution. Returns ------- result : AndersonResult An object with the following attributes: statistic : float The Anderson-Darling test statistic. critical_values : list The critical values for this distribution. significance_level : list The significance levels for the corresponding critical values in percents. The function returns critical values for a differing set of significance levels depending on the distribution that is being tested against. fit_result : `~scipy.stats._result_classes.FitResult` An object containing the results of fitting the distribution to the data. See Also -------- kstest : The Kolmogorov-Smirnov test for goodness-of-fit. Notes ----- Critical values provided are for the following significance levels: normal/exponential 15%, 10%, 5%, 2.5%, 1% logistic 25%, 10%, 5%, 2.5%, 1%, 0.5% gumbel_l / gumbel_r 25%, 10%, 5%, 2.5%, 1% weibull_min 50%, 25%, 15%, 10%, 5%, 2.5%, 1%, 0.5% If the returned statistic is larger than these critical values then for the corresponding significance level, the null hypothesis that the data come from the chosen distribution can be rejected. The returned statistic is referred to as 'A2' in the references. For `weibull_min`, maximum likelihood estimation is known to be challenging. If the test returns successfully, then the first order conditions for a maximum likelihood estimate have been verified and the critical values correspond relatively well to the significance levels, provided that the sample is sufficiently large (>10 observations [7]). However, for some data - especially data with no left tail - `anderson` is likely to result in an error message. In this case, consider performing a custom goodness of fit test using `scipy.stats.monte_carlo_test`. References ---------- .. [1] https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm .. [2] Stephens, M. A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons, Journal of the American Statistical Association, Vol. 69, pp. 730-737. .. [3] Stephens, M. A. (1976). Asymptotic Results for Goodness-of-Fit Statistics with Unknown Parameters, Annals of Statistics, Vol. 4, pp. 357-369. .. [4] Stephens, M. A. (1977). Goodness of Fit for the Extreme Value Distribution, Biometrika, Vol. 64, pp. 583-588. .. [5] Stephens, M. A. (1977). Goodness of Fit with Special Reference to Tests for Exponentiality , Technical Report No. 262, Department of Statistics, Stanford University, Stanford, CA. .. [6] Stephens, M. A. (1979). Tests of Fit for the Logistic Distribution Based on the Empirical Distribution Function, Biometrika, Vol. 66, pp. 591-595. .. [7] Richard A. Lockhart and Michael A. Stephens "Estimation and Tests of Fit for the Three-Parameter Weibull Distribution" Journal of the Royal Statistical Society.Series B(Methodological) Vol. 56, No. 3 (1994), pp. 491-500, Table 0. Examples -------- Test the null hypothesis that a random sample was drawn from a normal distribution (with unspecified mean and standard deviation). >>> import numpy as np >>> from scipy.stats import anderson >>> rng = np.random.default_rng() >>> data = rng.random(size=35) >>> res = anderson(data) >>> res.statistic 0.8398018749744764 >>> res.critical_values array([0.527, 0.6 , 0.719, 0.839, 0.998]) >>> res.significance_level array([15. , 10. , 5. , 2.5, 1. ]) The value of the statistic (barely) exceeds the critical value associated with a significance level of 2.5%, so the null hypothesis may be rejected at a significance level of 2.5%, but not at a significance level of 1%. >gumbelextreme1gumbel_l>rkexponrgumbel_rlogistic weibull_minz&Invalid distribution; dist must be in r?rrrkr!)ddofr) @r!r@g9@rrg333333?rc|\}}||z |z }t|}tjdd|zz dd|zz tj|d|z zd|zz d|zg}t|S)Nrr!rrr)rrrr)abxjrrrtmptmp2rys rarzanderson..rootfuncsvDAq6Q,Cs8D66#qv,Q/#a%766#s4x.!D&1:Q>@C: rbgh㈵>r)rrrr!r?r)rrrrr!g?rzCritical values of the test statistic are given for the asymptotic distribution. These may not be accurate for samples with fewer than 10 observations. Consider using `scipy.stats.monte_carlo_test`.rrdrR)rg?r?gffffff?g333333?gGz?gףp= ?gMb@?)decimalsNrz9`anderson` successfully fit the distribution to the data.T)rrbF)discreter)r)'lowerrVr rrWristdr.rklogcdflogsfrr _Avals_normr _Avals_exponrfsolver_Avals_logisticrr _Avals_gumbelrrr_r`rrr_get_As_weibullroundr rOptimizeResultrpr&rr)rprdistsr/rrrr]r fit_paramsrrsigcriticalrsol0solrbr[rerfcrA2rrs rar>r>Ysb ::  ((,,Q/a XN1W ''..q1&&,,Q/'(-3T!W+<=qA  ((,,Q/a XN1W ''..q1&&,,Q/'(-3T!W+<=qA  5 r6 MM'a 0%1155a8 3*AsE?A> 3U] ""J/66q9!!:.44Q7 EDE"1%88Hv-:q!a%A bffacCi1_tt(<=AF FBJG  ! !$ @C HHZ CE7=$7$)s4J "h CCrbcd}|j|d}||jk(rd}n|j|d|z }||dz z} td|D]} tj|| } | j|d} | j t } | | j|dz }| |dz z} |t |z || z| || zz dzz| || z z||zd z z z }||j|| z z }||dz |z z}|S) aCompute A2akN equation 7 of Scholz and Stephens. Parameters ---------- samples : sequence of 1-D array_like Array of sample arrays. Z : array_like Sorted array of all observations. Zstar : array_like Sorted array of unique observations. k : int Number of samples. n : array_like Number of observations in each sample. N : int Total number of observations. Returns ------- A2aKN : float The A2aKN statistics of Scholz and Stephens 1987. rleftrrightrgrsiderdr) searchsortedrr rr rrOr)samplesZZstarrrqrA2akNZ_ssorted_leftljBjrr]s_ssorted_rightMijfijinners ra_anderson_ksamp_midrankr& s*0 E^^E62NEJJ  ^^E7 +n < "r' !B Aq\$ GGGAJ ..W.=$$U+uf == sRxU1X 3AaD1 44AF ad2g8MN qt##$ a"f\E Lrbcd}|j|ddd|j|dddz }|j}td|D]r} tj|| } | j|ddd} |t |z || z||| zz dzz|||z zz } || j || z z }t|S) aCompute A2akN equation 6 of Scholz & Stephens. Parameters ---------- samples : sequence of 1-D array_like Array of sample arrays. Z : array_like Sorted array of all observations. Zstar : array_like Sorted array of unique observations. k : int Number of samples. n : array_like Number of observations in each sample. N : int Total number of observations. Returns ------- A2KN : float The A2KN statistics of Scholz and Stephens 1987. rNrrrrrrd)rcumsumr rr rOr) rrrrrqrA2kNrrrr]rrs ra_anderson_ksamp_rightrQ s0 D cr G ,q~~eCRj>D0F FB B Aq\# GGGAJ nnU3BZgn6U1X S2!9!4q 88B!b&MJ  ad"" # KrbAnderson_ksampResult)rQrr)r+ct|dkr tdtttj |}t j t j|jt jjdkr tdt j|Dcgc]}|jc}t jdk(r td|rtnt|}fd}|)tj||fi|j!ddi}d z j#}d t%d z d d z j'}|d d z} |t%dz j#} d | zd z d z zdd | zz |zz} d| zd z dzzd| zzzd| zd| zz d z |zzd| zz d | zzd z } d | zd| zzdz dzzd | zd | zz d zzzd| zd z |zzd | zz} d| zd zdzzd | zzz }| dzz| dzzz| zz|zd z dz zdz zz }d z }||z t)j*|z }t jgd}t jgd}t jgd}||t)j*|z z||z z}t jgd}||j-kr0|.|j/}d|d}t1j2|dn||j/kDr0|.|j-}d|d}t1j2|dn\|Jt j4|t7|d}t)j8t j:||}n| j<n}t?|||}||_ |Scc}w)agThe Anderson-Darling test for k-samples. The k-sample Anderson-Darling test is a modification of the one-sample Anderson-Darling test. It tests the null hypothesis that k-samples are drawn from the same population without having to specify the distribution function of that population. The critical values depend on the number of samples. Parameters ---------- samples : sequence of 1-D array_like Array of sample data in arrays. midrank : bool, optional Type of Anderson-Darling test which is computed. Default (True) is the midrank test applicable to continuous and discrete populations. If False, the right side empirical distribution is used. method : PermutationMethod, optional Defines the method used to compute the p-value. If `method` is an instance of `PermutationMethod`, the p-value is computed using `scipy.stats.permutation_test` with the provided configuration options and other appropriate settings. Otherwise, the p-value is interpolated from tabulated values. Returns ------- res : Anderson_ksampResult An object containing attributes: statistic : float Normalized k-sample Anderson-Darling test statistic. critical_values : array The critical values for significance levels 25%, 10%, 5%, 2.5%, 1%, 0.5%, 0.1%. pvalue : float The approximate p-value of the test. If `method` is not provided, the value is floored / capped at 0.1% / 25%. Raises ------ ValueError If fewer than 2 samples are provided, a sample is empty, or no distinct observations are in the samples. See Also -------- ks_2samp : 2 sample Kolmogorov-Smirnov test anderson : 1 sample Anderson-Darling test Notes ----- [1]_ defines three versions of the k-sample Anderson-Darling test: one for continuous distributions and two for discrete distributions, in which ties between samples may occur. The default of this routine is to compute the version based on the midrank empirical distribution function. This test is applicable to continuous and discrete data. If midrank is set to False, the right side empirical distribution is used for a test for discrete data. According to [1]_, the two discrete test statistics differ only slightly if a few collisions due to round-off errors occur in the test not adjusted for ties between samples. The critical values corresponding to the significance levels from 0.01 to 0.25 are taken from [1]_. p-values are floored / capped at 0.1% / 25%. Since the range of critical values might be extended in future releases, it is recommended not to test ``p == 0.25``, but rather ``p >= 0.25`` (analogously for the lower bound). .. versionadded:: 0.14.0 References ---------- .. [1] Scholz, F. W and Stephens, M. A. (1987), K-Sample Anderson-Darling Tests, Journal of the American Statistical Association, Vol. 82, pp. 918-924. Examples -------- >>> import numpy as np >>> from scipy import stats >>> rng = np.random.default_rng() >>> res = stats.anderson_ksamp([rng.normal(size=50), ... rng.normal(loc=0.5, size=30)]) >>> res.statistic, res.pvalue (1.974403288713695, 0.04991293614572478) >>> res.critical_values array([0.325, 1.226, 1.961, 2.718, 3.752, 4.592, 6.546]) The null hypothesis that the two random samples come from the same distribution can be rejected at the 5% level because the returned test value is greater than the critical value for 5% (1.961) but not at the 2.5% level. The interpolation gives an approximate p-value of 4.99%. >>> samples = [rng.normal(size=50), rng.normal(size=30), ... rng.normal(size=20)] >>> res = stats.anderson_ksamp(samples) >>> res.statistic, res.pvalue (-0.29103725200789504, 0.25) >>> res.critical_values array([ 0.44925884, 1.3052767 , 1.9434184 , 2.57696569, 3.41634856, 4.07210043, 5.56419101]) The null hypothesis cannot be rejected for three samples from an identical distribution. The reported p-value (25%) has been capped and may not be very accurate (since it corresponds to the value 0.449 whereas the statistic is -0.291). In such cases where the p-value is capped or when sample sizes are small, a permutation test may be more accurate. >>> method = stats.PermutationMethod(n_resamples=9999, random_state=rng) >>> res = stats.anderson_ksamp(samples, method=method) >>> res.pvalue 0.5254 rdz)anderson_ksamp needs at least two samplesz7anderson_ksamp needs more than one distinct observationrz6anderson_ksamp encountered sample without observationsc|Sr{r|)rA2kN_funrrrrrqs rarQz!anderson_ksamp..statistic sE1a33rb alternativegreaterrr!rrrrrgr)g?g"~?gRQ?g\(\?gS㥛@g/$@gGz@)g\(\ϿrgV-?gMb?gx&?gx@gQ @)gzGếgQӿg^I +׿g/$ٿgMbXٿgGzֿgʡEÿ)rr皙?g?rg{Gzt?gMbP?z'p-value capped: true value larger than zI. Consider specifying `method` (e.g. `method=stats.PermutationMethod()`.)rRz)p-value floored: true value smaller than )!rirVlistmaprr r hstackrrrr.rrrpermutation_test_asdictrr rrlrr]rYr_r`polyfitrrpolyvalrrr) rmidrankr+samplerrQrHhs_cshgrrrdsigmasqr[rb0b1b2rrprpfr rrrrrqs @@@@@@rarIrI{ s l G A ADEE3rzz7+,G  '"#A A IIaLE zzA~'( ( G4&&++45A vva1f~() )*( GQq!Q /D44$$Wi<6>>;K<1:< a A &Q2& & . . 0E b A A 1 ""$A 1qQUrAaCxl*A 1q!Q$1Q!A#1*q.!!33ac9AaC?!CA 1qsQ1!ac A q00AaC!GQ;>1DA 1q!Q$1QAAv!Q$1$q(a"fR-@AF-KLG AA (dii( (B B CB C DB J KBB1%%Q.H ((? @C HLLNv~ GGI8<<<  ca( hlln  GGI:1#><<  ca(  ZZ#c(A . HHRZZB' ( ,CJJ! r8Q /CC J}5s;P AnsariResultc.eZdZdZdZdZdZdZdZy)_ABWzEDistribution of Ansari-Bradley W-statistic under the null hypothesis.cJd|_d|_d|_d|_d|_y)zMinimal initializer.N)r[rqastarttotalfreqsr8s ra__init__z _ABW.__init__G s%   rbc||jk7s||jk7rj||c|_|_t||\}}}||_|j t j |_|jj|_ yy)z/When necessary, recalculate exact distribution.N) rqr[r"r)rrrr+rr*)r9rqr[r)a1rs ra_recalcz _ABW._recalcO sm ;!tvv+NDFDF#1aLMFB DK2::.DJ)DJ&rbc|j||tj||jz j t }|j ||jz S)zProbability mass function.)r/rfloorr)rrr+r*r9rrqr[inds rapmfz_ABW.pmf^ sJ Qhhq4;;'..s3zz#++rbc|j||tj||jz j t }|j d|dzj|jz S)z!Cumulative distribution function.Nr!) r/rceilr)rrr+rr*r2s racdfz_ABW.cdff sZ Qgga$++o&--c2zz&3q5!%%'$**44rbc|j||tj||jz j t }|j |dj|jz S)zSurvival function.N) r/rr1r)rrr+rr*r2s rasfz_ABW.sfn sV Qhhq4;;'..s3zz#$##% 22rbN) r;r<r=__doc__r,r/r4r7r9r|rbrar'r'A sO  *,53rbr')rc |dvr tdttdstt_t |t |}}t |}t |}|dkr td|dkr td||z}t||f}tj|}tt|||z dzfd}tj|d|d } t|} t | t |k7} |d kxr |d kxr| } | r!|d ks|d krtj d d | r|dk(rXdtj"tjj%| ||tjj'| ||z} nH|dk(r"tjj%| ||} n!tjj'| ||} t)| t+d| S|d zr/||dzd zzdz |z }||z|dzzd|d zzzd|d zzz }n%||dzzdz }||z|d zz|dz zdz |dz z }| ritj|d zd }|d zr'||zd|z|z|dzdzz zd|d zz|dz zz }n#||zd|z||d zd zzz zd|z|dz zz }|| z t-|z }t/|t1|t}t)| d|dS)aPerform the Ansari-Bradley test for equal scale parameters. The Ansari-Bradley test ([1]_, [2]_) is a non-parametric test for the equality of the scale parameter of the distributions from which two samples were drawn. The null hypothesis states that the ratio of the scale of the distribution underlying `x` to the scale of the distribution underlying `y` is 1. Parameters ---------- x, y : array_like Arrays of sample data. alternative : {'two-sided', 'less', 'greater'}, optional Defines the alternative hypothesis. Default is 'two-sided'. The following options are available: * 'two-sided': the ratio of scales is not equal to 1. * 'less': the ratio of scales is less than 1. * 'greater': the ratio of scales is greater than 1. .. versionadded:: 1.7.0 Returns ------- statistic : float The Ansari-Bradley test statistic. pvalue : float The p-value of the hypothesis test. See Also -------- fligner : A non-parametric test for the equality of k variances mood : A non-parametric test for the equality of two scale parameters Notes ----- The p-value given is exact when the sample sizes are both less than 55 and there are no ties, otherwise a normal approximation for the p-value is used. References ---------- .. [1] Ansari, A. R. and Bradley, R. A. (1960) Rank-sum tests for dispersions, Annals of Mathematical Statistics, 31, 1174-1189. .. [2] Sprent, Peter and N.C. Smeeton. Applied nonparametric statistical methods. 3rd ed. Chapman and Hall/CRC. 2001. Section 5.8.2. .. [3] Nathaniel E. Helwig "Nonparametric Dispersion and Equality Tests" at http://users.stat.umn.edu/~helwig/notes/npde-Notes.pdf Examples -------- >>> import numpy as np >>> from scipy.stats import ansari >>> rng = np.random.default_rng() For these examples, we'll create three random data sets. The first two, with sizes 35 and 25, are drawn from a normal distribution with mean 0 and standard deviation 2. The third data set has size 25 and is drawn from a normal distribution with standard deviation 1.25. >>> x1 = rng.normal(loc=0, scale=2, size=35) >>> x2 = rng.normal(loc=0, scale=2, size=25) >>> x3 = rng.normal(loc=0, scale=1.25, size=25) First we apply `ansari` to `x1` and `x2`. These samples are drawn from the same distribution, so we expect the Ansari-Bradley test should not lead us to conclude that the scales of the distributions are different. >>> ansari(x1, x2) AnsariResult(statistic=541.0, pvalue=0.9762532927399098) With a p-value close to 1, we cannot conclude that there is a significant difference in the scales (as expected). Now apply the test to `x1` and `x3`: >>> ansari(x1, x3) AnsariResult(statistic=425.0, pvalue=0.0003087020407974518) The probability of observing such an extreme value of the statistic under the null hypothesis of equal scales is only 0.03087%. We take this as evidence against the null hypothesis in favor of the alternative: the scales of the distributions from which the samples were drawn are not equal. We can use the `alternative` parameter to perform a one-tailed test. In the above example, the scale of `x1` is greater than `x3` and so the ratio of scales of `x1` and `x3` is greater than 1. This means that the p-value when ``alternative='greater'`` should be near 0 and hence we should be able to reject the null hypothesis: >>> ansari(x1, x3, alternative='greater') AnsariResult(statistic=425.0, pvalue=0.0001543510203987259) As we can see, the p-value is indeed quite low. Use of ``alternative='less'`` should thus yield a large p-value: >>> ansari(x1, x3, alternative='less') AnsariResult(statistic=425.0, pvalue=0.9998643258449039) >lessr  two-sidedz8'alternative' must be 'two-sided', 'greater', or 'less'.rr!zNot enough other observations.zNot enough test observations.rNr7z%Ties preclude use of exact statistic.rdrRr=rgr rrrgH@0rg0@rr|)rVr _abw_stater'rr rirr$rankdatar rrrrr_r`minimumr7r9r%r]rr(r*)rpr/r rqr[rxyranksymrankABuxyrepeatsexactpvalmnABvarABrxrors rar?r?} sR::23 3 :s #v 1:wqzqA AA AA1u9::1u899 AA AqDB   b !D5$D1 -.2G  ! $B *C3x3r7"G"f 21r6 27{EAFa"f =!L + %JLL$4$4RA$>$.LLOOB1$=??D I %<<##B1-D<>> import numpy as np >>> from scipy import stats >>> a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99] >>> b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05] >>> c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98] >>> stat, p = stats.bartlett(a, b, c) >>> p 1.1254782518834628e-05 The very small p-value suggests that the populations do not have equal variances. This is not surprising, given that the sample variance of `b` is much larger than that of `a` and `c`: >>> [np.var(x, ddof=1) for x in [a, b, c]] [0.007054444444444413, 0.13073888888888888, 0.008890000000000002] For a more detailed example, see :ref:`hypothesis_bartlett`. rd-Must enter at least two input sample vectors.)rrrrrNr!) correctionr.r)rrrr Fr  symmetricrr)r]rYr|)rrirVr1moveaxisr rrrrjnewaxisconcatrrr+r(cliprrrN)rrrrrNirssqarrrNtotspsqnumerdenomTr rs rar@r@% sV ' "B G A1uHIIdr:G;BCr{{64,CGCIP Qv"**V\\"%V\\* : QB Q=? @"//!WQZ--cr2 3 @B @?F GV266&QR6 0 GC G*, -3#bjj#o  -B -+. /C3rzz3  /C / 2A B ))Ca) C HHE 66"16 D 66263,Qe6 4q ADQh"&&, &vvrAvrvvc{*%v@AE Aq1uI26 +q$(|;==E  A rzz!A# 'D Di5R PF rrvv&A1"!A!;;!+VBZF !V $$3D Q @ G - /s#I41I9 )I><JJ;J  LeveneResultmedianr)centerproportiontocutc|dvr tdt|}|dkr tdtj|}tj|d}|dk(rd}n |dk(rd }nt fd |D}d }t |D]!}t||||<|||||<#tj |d }dg|z} t |D]"} tt|| || z | | <$tj|d} d} t |D]-} tj| | d | | <| | | || zz } /| |z} ||z tj || | z dzzd z} d}t |D](} |tj | | | | z dzd z }*|dz |z}| |z }tjj||dz ||z }t||S)a Perform Levene test for equal variances. The Levene test tests the null hypothesis that all input samples are from populations with equal variances. Levene's test is an alternative to Bartlett's test `bartlett` in the case where there are significant deviations from normality. Parameters ---------- sample1, sample2, ... : array_like The sample data, possibly with different lengths. Only one-dimensional samples are accepted. center : {'mean', 'median', 'trimmed'}, optional Which function of the data to use in the test. The default is 'median'. proportiontocut : float, optional When `center` is 'trimmed', this gives the proportion of data points to cut from each end. (See `scipy.stats.trim_mean`.) Default is 0.05. Returns ------- statistic : float The test statistic. pvalue : float The p-value for the test. See Also -------- fligner : A non-parametric test for the equality of k variances bartlett : A parametric test for equality of k variances in normal samples :ref:`hypothesis_levene` : Extended example Notes ----- Three variations of Levene's test are possible. The possibilities and their recommended usages are: * 'median' : Recommended for skewed (non-normal) distributions> * 'mean' : Recommended for symmetric, moderate-tailed distributions. * 'trimmed' : Recommended for heavy-tailed distributions. The test version using the mean was proposed in the original article of Levene ([2]_) while the median and trimmed mean have been studied by Brown and Forsythe ([3]_), sometimes also referred to as Brown-Forsythe test. References ---------- .. [1] https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm .. [2] Levene, H. (1960). In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, I. Olkin et al. eds., Stanford University Press, pp. 278-292. .. [3] Brown, M. B. and Forsythe, A. B. (1974), Journal of the American Statistical Association, 69, 364-367 Examples -------- Test whether the lists `a`, `b` and `c` come from populations with equal variances. >>> import numpy as np >>> from scipy import stats >>> a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99] >>> b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05] >>> c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98] >>> stat, p = stats.levene(a, b, c) >>> p 0.002431505967249681 The small p-value suggests that the populations do not have equal variances. This is not surprising, given that the sample variance of `b` is much larger than that of `a` and `c`: >>> [np.var(x, ddof=1) for x in [a, b, c]] [0.007054444444444413, 0.13073888888888888, 0.008890000000000002] For a more detailed example, see :ref:`hypothesis_levene`. rWratrimmed-center must be 'mean', 'median' or 'trimmed'.rdrPrrac0tj|dSNrrrrar}s rarzlevene..func 99QQ' 'rbrWc0tj|dSrirrWr}s rarzlevene..func 7711% %rbc3pK|]-}tjtj|/ywr{)r$trimbothrr .0rrcs ra zlevene.. s./""**2776?OL/s36c0tj|dSrirmr}s rarzlevene..func rnrbrrNrrr!)rVrirrrrrrr rWr.fr9r`)rbrcrrrXYcirjr[ZijrZbariZbarr]dvarr^WrKs ` rarArA s*h22HII G A1uHII !B ((1c C  ( 6  &/&-// &1X"GAJ1gaj!A" 66"1 D &1*C 1X3WWQZ(3q612A3 HHQ E D 1X!773q6*a a2a5  ! DLD AXedlQ%6 6Q? ?E D 1X7 Aq)A-A667W E  A ??  a1d1f -D 4  rbc ttd|t|f}tt|dz Dcgc]}||||||dz}}t |Scc}w)Nrr!)rrrirr )rprrroutputs ra _apply_funcr% sc r!QA, A,1#a&1*,= >qd1QqT!AaC&>" >F > 6??sA  FlignerResultc |dvr tdt|}|dkr td|D]'}|jdk(st|}t ||cS|dk(rd}n |dk(rd }nt fd |D}d }t t|Dcgc]}t||c}}t t|Dcgc] }|||c}} tj|d } t|D cgc]} tt || | | z !} } g} dg}t|D]9} | jt| | |jt| ;tj| }t j"j%|d| d zzz dz}t'||tj|z }tj(|d }tj*|dd}tj|t ||z dzzd |z }t-|dz }t/||ddt}t ||Scc}wcc}wcc} w)a Perform Fligner-Killeen test for equality of variance. Fligner's test tests the null hypothesis that all input samples are from populations with equal variances. Fligner-Killeen's test is distribution free when populations are identical [2]_. Parameters ---------- sample1, sample2, ... : array_like Arrays of sample data. Need not be the same length. center : {'mean', 'median', 'trimmed'}, optional Keyword argument controlling which function of the data is used in computing the test statistic. The default is 'median'. proportiontocut : float, optional When `center` is 'trimmed', this gives the proportion of data points to cut from each end. (See `scipy.stats.trim_mean`.) Default is 0.05. Returns ------- statistic : float The test statistic. pvalue : float The p-value for the hypothesis test. See Also -------- bartlett : A parametric test for equality of k variances in normal samples levene : A robust parametric test for equality of k variances :ref:`hypothesis_fligner` : Extended example Notes ----- As with Levene's test there are three variants of Fligner's test that differ by the measure of central tendency used in the test. See `levene` for more information. Conover et al. (1981) examine many of the existing parametric and nonparametric tests by extensive simulations and they conclude that the tests proposed by Fligner and Killeen (1976) and Levene (1960) appear to be superior in terms of robustness of departures from normality and power [3]_. References ---------- .. [1] Park, C. and Lindsay, B. G. (1999). Robust Scale Estimation and Hypothesis Testing based on Quadratic Inference Function. Technical Report #99-03, Center for Likelihood Studies, Pennsylvania State University. https://cecas.clemson.edu/~cspark/cv/paper/qif/draftqif2.pdf .. [2] Fligner, M.A. and Killeen, T.J. (1976). Distribution-free two-sample tests for scale. Journal of the American Statistical Association. 71(353), 210-213. .. [3] Park, C. and Lindsay, B. G. (1999). Robust Scale Estimation and Hypothesis Testing based on Quadratic Inference Function. Technical Report #99-03, Center for Likelihood Studies, Pennsylvania State University. .. [4] Conover, W. J., Johnson, M. E. and Johnson M. M. (1981). A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics, 23(4), 351-361. Examples -------- >>> import numpy as np >>> from scipy import stats Test whether the lists `a`, `b` and `c` come from populations with equal variances. >>> a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99] >>> b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05] >>> c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98] >>> stat, p = stats.fligner(a, b, c) >>> p 0.00450826080004775 The small p-value suggests that the populations do not have equal variances. This is not surprising, given that the sample variance of `b` is much larger than that of `a` and `c`: >>> [np.var(x, ddof=1) for x in [a, b, c]] [0.007054444444444413, 0.13073888888888888, 0.008890000000000002] For a more detailed example, see :ref:`hypothesis_fligner`. rergrdrPrrac0tj|dSrirjr}s rarzfligner..func rkrbrWc0tj|dSrirmr}s rarzfligner..func rnrbc3JK|]}tj|ywr{)r$rprqs rarszfligner.. s&/""**6?C/s #c0tj|dSrirmr}s rarzfligner..func rnrbrrrr!)rrrgr FrR)rVrirrrrr rrrrextendrappendr$rBr.rkrrrWrjr+r()rbrcrrrNaNrrwrXrvr[rrxallZijrranksAibaranbarvarsqrQr rKs ` rarBrB2 sFv22HII G A1uHII+ ;;! G$C c* *+  ( 6  &/&-// & 584a#gaj/4 5B U1X64 #6 7C 66"1 D6;Ah ?3wwqz"SV+ , ?C ? F A 1X d3q6l# V   v &E    # #EQs ^$$I!c|fSr{r|)x1s rar~r~ sbUrbr)rrreturnc ltjdg|f}||dk7}tjtjtj|dk7t dd} t |} tjd| dzt } tjtj||f} tj| }tjdg|f}tj|dk7t } tjtj| dd}|| z }tjdg| f} tjdg|f}tj| }tjdg|ddf}fd}|| dz dz}|| dz}t| Dcgc]}tj|||| }}|Dcgc]}tj||c}| | z }t||| z}|zdz zdz }||zdzzdzd z zd z ||zd zdz zz tj| | dzdz z| dzd z d |z |z dzzzzzz }||z tj|z fScc}wcc}w) Nr!rrrc |dzdz z dzS)Nr!rdr|)rgrs rapsiz_mood_inner_lc..psi sQUAI%))rbrrrdrr) r concatenatebincountrr rrir r diffrrr)rDrpdiffs sorted_xyrqr[r diffs_prepuniquesrmrjs sorted_xyx diff_is_zero xyx_countsrrS_i_m1rs_lowers_upperidxphi_JI_jphisr_E_0_TvarMs ` ra_mood_inner_lcr s!e -J a(G BIIbjjqDEFqrJA G A 1a!e3 'B Q01J GGJ E!e -J::jAoS9LRYY|45ab9JQA Qx A Qx A ! A^^aS!CR&M *F * Qi!mGeaiG>CAh GsRYYws|WS\ 2 GE G ). .BFF3s8  .2 6D D1R5LA QOb E EQW a! ,s 2 ES1WA& '"&&QTAX!Q$(bAEFNq3H.H"IJ+  D Y"''$- ' ))) H /s =#J,&"J1c\|\}}|j|}|j|}||z}|dkS)Nr)r)rr rrpr/rqr[rs ra_mood_too_smallr s7 DAq  A  A AA q5Lrb)rrc tj|t}tj|t}|dkr|j|z}t t t |jDcgc]}||k7s |j|c}}|t t t |jDcgc]}||k7r|j|c}k(s td|j|}|j|}||z}|dkr tdtj||f|} tj| |} tj| |} d| vr'tjt| || | ||||} n|dk7rtj| |d} | j| jdd} tj| } t | jdD]%}t!j"| d d |f| d d |f<'| d |}tj$||d zd z z d zd}|||zd z zd z }||z|d zz|d zz|d z zd z }||z t'|z } t)| t+|t}|dk(r | d} |d}n|| _||_t-| d|dScc}wcc}w)a Perform Mood's test for equal scale parameters. Mood's two-sample test for scale parameters is a non-parametric test for the null hypothesis that two samples are drawn from the same distribution with the same scale parameter. Parameters ---------- x, y : array_like Arrays of sample data. There must be at least three observations total. axis : int, optional The axis along which the samples are tested. `x` and `y` can be of different length along `axis`. If `axis` is None, `x` and `y` are flattened and the test is done on all values in the flattened arrays. alternative : {'two-sided', 'less', 'greater'}, optional Defines the alternative hypothesis. Default is 'two-sided'. The following options are available: * 'two-sided': the scales of the distributions underlying `x` and `y` are different. * 'less': the scale of the distribution underlying `x` is less than the scale of the distribution underlying `y`. * 'greater': the scale of the distribution underlying `x` is greater than the scale of the distribution underlying `y`. .. versionadded:: 1.7.0 Returns ------- res : SignificanceResult An object containing attributes: statistic : scalar or ndarray The z-score for the hypothesis test. For 1-D inputs a scalar is returned. pvalue : scalar ndarray The p-value for the hypothesis test. See Also -------- fligner : A non-parametric test for the equality of k variances ansari : A non-parametric test for the equality of 2 variances bartlett : A parametric test for equality of k variances in normal samples levene : A parametric test for equality of k variances Notes ----- The data are assumed to be drawn from probability distributions ``f(x)`` and ``f(x/s) / s`` respectively, for some probability density function f. The null hypothesis is that ``s == 1``. For multi-dimensional arrays, if the inputs are of shapes ``(n0, n1, n2, n3)`` and ``(n0, m1, n2, n3)``, then if ``axis=1``, the resulting z and p values will have shape ``(n0, n2, n3)``. Note that ``n1`` and ``m1`` don't have to be equal, but the other dimensions do. References ---------- [1] Mielke, Paul W. "Note on Some Squared Rank Tests with Existing Ties." Technometrics, vol. 9, no. 2, 1967, pp. 312-14. JSTOR, https://doi.org/10.2307/1266427. Accessed 18 May 2022. Examples -------- >>> import numpy as np >>> from scipy import stats >>> rng = np.random.default_rng() >>> x2 = rng.standard_normal((2, 45, 6, 7)) >>> x1 = rng.standard_normal((2, 30, 6, 7)) >>> res = stats.mood(x1, x2, axis=1) >>> res.pvalue.shape (2, 6, 7) Find the number of points where the difference in scale is not significant: >>> (res.pvalue > 0.1).sum() 78 Perform the test with different scales: >>> x1 = rng.standard_normal((2, 30)) >>> x2 = rng.standard_normal((2, 35)) * 10.0 >>> stats.mood(x1, x2, axis=1) SignificanceResult(statistic=array([-5.76174136, -6.12650783]), pvalue=array([8.32505043e-09, 8.98287869e-10])) rrzDB aDAw aeT"X ..]P ,s0 K ;K 9K WilcoxonResultrQrct|dr#|j|j|jfS|j|jfS)N zstatistic)rrQrr)rrs rawilcoxon_result_unpackerr s8sL!}}cjj#..88}}cjj((rbc0t||}|||_|Sr{)rr)rQrrrs rawilcoxon_result_objectr s F +C# Jrbc4|jdd}|dk(ryy)Nr+auto asymptoticrrdget)kwdsr+s rawilcoxon_outputsr s XXh 'F  rbmoder+c.|jdddSdS)Nr/rdr!r)rs rar~r~ sd 3 ?1Qrb)pairedrrrc F|dk(rd}tj|||||||S)a)Calculate the Wilcoxon signed-rank test. The Wilcoxon signed-rank test tests the null hypothesis that two related paired samples come from the same distribution. In particular, it tests whether the distribution of the differences ``x - y`` is symmetric about zero. It is a non-parametric version of the paired T-test. Parameters ---------- x : array_like Either the first set of measurements (in which case ``y`` is the second set of measurements), or the differences between two sets of measurements (in which case ``y`` is not to be specified.) Must be one-dimensional. y : array_like, optional Either the second set of measurements (if ``x`` is the first set of measurements), or not specified (if ``x`` is the differences between two sets of measurements.) Must be one-dimensional. .. warning:: When `y` is provided, `wilcoxon` calculates the test statistic based on the ranks of the absolute values of ``d = x - y``. Roundoff error in the subtraction can result in elements of ``d`` being assigned different ranks even when they would be tied with exact arithmetic. Rather than passing `x` and `y` separately, consider computing the difference ``x - y``, rounding as needed to ensure that only truly unique elements are numerically distinct, and passing the result as `x`, leaving `y` at the default (None). zero_method : {"wilcox", "pratt", "zsplit"}, optional There are different conventions for handling pairs of observations with equal values ("zero-differences", or "zeros"). * "wilcox": Discards all zero-differences (default); see [4]_. * "pratt": Includes zero-differences in the ranking process, but drops the ranks of the zeros (more conservative); see [3]_. In this case, the normal approximation is adjusted as in [5]_. * "zsplit": Includes zero-differences in the ranking process and splits the zero rank between positive and negative ones. correction : bool, optional If True, apply continuity correction by adjusting the Wilcoxon rank statistic by 0.5 towards the mean value when computing the z-statistic if a normal approximation is used. Default is False. alternative : {"two-sided", "greater", "less"}, optional Defines the alternative hypothesis. Default is 'two-sided'. In the following, let ``d`` represent the difference between the paired samples: ``d = x - y`` if both ``x`` and ``y`` are provided, or ``d = x`` otherwise. * 'two-sided': the distribution underlying ``d`` is not symmetric about zero. * 'less': the distribution underlying ``d`` is stochastically less than a distribution symmetric about zero. * 'greater': the distribution underlying ``d`` is stochastically greater than a distribution symmetric about zero. method : {"auto", "exact", "asymptotic"} or `PermutationMethod` instance, optional Method to calculate the p-value, see Notes. Default is "auto". axis : int or None, default: 0 If an int, the axis of the input along which to compute the statistic. The statistic of each axis-slice (e.g. row) of the input will appear in a corresponding element of the output. If ``None``, the input will be raveled before computing the statistic. Returns ------- An object with the following attributes. statistic : array_like If `alternative` is "two-sided", the sum of the ranks of the differences above or below zero, whichever is smaller. Otherwise the sum of the ranks of the differences above zero. pvalue : array_like The p-value for the test depending on `alternative` and `method`. zstatistic : array_like When ``method = 'asymptotic'``, this is the normalized z-statistic:: z = (T - mn - d) / se where ``T`` is `statistic` as defined above, ``mn`` is the mean of the distribution under the null hypothesis, ``d`` is a continuity correction, and ``se`` is the standard error. When ``method != 'asymptotic'``, this attribute is not available. See Also -------- kruskal, mannwhitneyu Notes ----- In the following, let ``d`` represent the difference between the paired samples: ``d = x - y`` if both ``x`` and ``y`` are provided, or ``d = x`` otherwise. Assume that all elements of ``d`` are independent and identically distributed observations, and all are distinct and nonzero. - When ``len(d)`` is sufficiently large, the null distribution of the normalized test statistic (`zstatistic` above) is approximately normal, and ``method = 'asymptotic'`` can be used to compute the p-value. - When ``len(d)`` is small, the normal approximation may not be accurate, and ``method='exact'`` is preferred (at the cost of additional execution time). - The default, ``method='auto'``, selects between the two: ``method='exact'`` is used when ``len(d) <= 50``, and ``method='asymptotic'`` is used otherwise. The presence of "ties" (i.e. not all elements of ``d`` are unique) or "zeros" (i.e. elements of ``d`` are zero) changes the null distribution of the test statistic, and ``method='exact'`` no longer calculates the exact p-value. If ``method='asymptotic'``, the z-statistic is adjusted for more accurate comparison against the standard normal, but still, for finite sample sizes, the standard normal is only an approximation of the true null distribution of the z-statistic. For such situations, the `method` parameter also accepts instances of `PermutationMethod`. In this case, the p-value is computed using `permutation_test` with the provided configuration options and other appropriate settings. The presence of ties and zeros affects the resolution of ``method='auto'`` accordingly: exhasutive permutations are performed when ``len(d) <= 13``, and the asymptotic method is used otherwise. Note that they asymptotic method may not be very accurate even for ``len(d) > 14``; the threshold was chosen as a compromise between execution time and accuracy under the constraint that the results must be deterministic. Consider providing an instance of `PermutationMethod` method manually, choosing the ``n_resamples`` parameter to balance time constraints and accuracy requirements. Please also note that in the edge case that all elements of ``d`` are zero, the p-value relying on the normal approximaton cannot be computed (NaN) if ``zero_method='wilcox'`` or ``zero_method='pratt'``. References ---------- .. [1] https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test .. [2] Conover, W.J., Practical Nonparametric Statistics, 1971. .. [3] Pratt, J.W., Remarks on Zeros and Ties in the Wilcoxon Signed Rank Procedures, Journal of the American Statistical Association, Vol. 54, 1959, pp. 655-667. :doi:`10.1080/01621459.1959.10501526` .. [4] Wilcoxon, F., Individual Comparisons by Ranking Methods, Biometrics Bulletin, Vol. 1, 1945, pp. 80-83. :doi:`10.2307/3001968` .. [5] Cureton, E.E., The Normal Approximation to the Signed-Rank Sampling Distribution When Zero Differences are Present, Journal of the American Statistical Association, Vol. 62, 1967, pp. 1068-1069. :doi:`10.1080/01621459.1967.10500917` Examples -------- In [4]_, the differences in height between cross- and self-fertilized corn plants is given as follows: >>> d = [6, 8, 14, 16, 23, 24, 28, 29, 41, -48, 49, 56, 60, -67, 75] Cross-fertilized plants appear to be higher. To test the null hypothesis that there is no height difference, we can apply the two-sided test: >>> from scipy.stats import wilcoxon >>> res = wilcoxon(d) >>> res.statistic, res.pvalue (24.0, 0.041259765625) Hence, we would reject the null hypothesis at a confidence level of 5%, concluding that there is a difference in height between the groups. To confirm that the median of the differences can be assumed to be positive, we use: >>> res = wilcoxon(d, alternative='greater') >>> res.statistic, res.pvalue (96.0, 0.0206298828125) This shows that the null hypothesis that the median is negative can be rejected at a confidence level of 5% in favor of the alternative that the median is greater than zero. The p-values above are exact. Using the normal approximation gives very similar values: >>> res = wilcoxon(d, method='asymptotic') >>> res.statistic, res.pvalue (24.0, 0.04088813291185591) Note that the statistic changed to 96 in the one-sided case (the sum of ranks of positive differences) whereas it is 24 in the two-sided case (the minimum of sum of ranks above and below zero). In the example above, the differences in height between paired plants are provided to `wilcoxon` directly. Alternatively, `wilcoxon` accepts two samples of equal length, calculates the differences between paired elements, then performs the test. Consider the samples ``x`` and ``y``: >>> import numpy as np >>> x = np.array([0.5, 0.825, 0.375, 0.5]) >>> y = np.array([0.525, 0.775, 0.325, 0.55]) >>> res = wilcoxon(x, y, alternative='greater') >>> res WilcoxonResult(statistic=5.0, pvalue=0.5625) Note that had we calculated the differences by hand, the test would have produced different results: >>> d = [-0.025, 0.05, 0.05, -0.05] >>> ref = wilcoxon(d, alternative='greater') >>> ref WilcoxonResult(statistic=6.0, pvalue=0.5) The substantial difference is due to roundoff error in the results of ``x-y``: >>> d - (x-y) array([2.08166817e-17, 6.93889390e-17, 1.38777878e-17, 4.16333634e-17]) Even though we expected all the elements of ``(x-y)[1:]`` to have the same magnitude ``0.05``, they have slightly different magnitudes in practice, and therefore are assigned different ranks in the test. Before performing the test, consider calculating ``d`` and adjusting it as necessary to ensure that theoretically identically values are not numerically distinct. For example: >>> d2 = np.around(x - y, decimals=3) >>> wilcoxon(d2, alternative='greater') WilcoxonResult(statistic=6.0, pvalue=0.5) approxr)r% _wilcoxon_nd)rpr/ zero_methodrQr r+rs rarDrD s4R  ! !!Q Z"($ 00rbMedianTestResult)rQrratablebelow)tiesrQlambda_rct|dkr tdgd}||vrtd|dt|dd|Dcgc]}tj|}}t |D]T\}} | j dk(rtd |dzd | jdk7s7td |dzd | jd tj|} t| |} |d k(r:| r8ttjtjtjdS| r-tj| tj| } ntj| } tjdt|ftj} t |D]\}}|tj|}t!|| kD}t!|| k}|j ||zz }| d|fxx|z cc<| d|fxx|z cc<|dk(r| d|fxx|z cc<|dk(s| d|fxx|z cc<| j#d}|ddk(rtd| d|ddk(rtd| d|dk(rQtj$| dk(j'dd}t|dkDrtd|ddzd| dt)| ||\}}}}t||| | Scc}w)a Perform a Mood's median test. Test that two or more samples come from populations with the same median. Let ``n = len(samples)`` be the number of samples. The "grand median" of all the data is computed, and a contingency table is formed by classifying the values in each sample as being above or below the grand median. The contingency table, along with `correction` and `lambda_`, are passed to `scipy.stats.chi2_contingency` to compute the test statistic and p-value. Parameters ---------- sample1, sample2, ... : array_like The set of samples. There must be at least two samples. Each sample must be a one-dimensional sequence containing at least one value. The samples are not required to have the same length. ties : str, optional Determines how values equal to the grand median are classified in the contingency table. The string must be one of:: "below": Values equal to the grand median are counted as "below". "above": Values equal to the grand median are counted as "above". "ignore": Values equal to the grand median are not counted. The default is "below". correction : bool, optional If True, *and* there are just two samples, apply Yates' correction for continuity when computing the test statistic associated with the contingency table. Default is True. lambda_ : float or str, optional By default, the statistic computed in this test is Pearson's chi-squared statistic. `lambda_` allows a statistic from the Cressie-Read power divergence family to be used instead. See `power_divergence` for details. Default is 1 (Pearson's chi-squared statistic). nan_policy : {'propagate', 'raise', 'omit'}, optional Defines how to handle when input contains nan. 'propagate' returns nan, 'raise' throws an error, 'omit' performs the calculations ignoring nan values. Default is 'propagate'. Returns ------- res : MedianTestResult An object containing attributes: statistic : float The test statistic. The statistic that is returned is determined by `lambda_`. The default is Pearson's chi-squared statistic. pvalue : float The p-value of the test. median : float The grand median. table : ndarray The contingency table. The shape of the table is (2, n), where n is the number of samples. The first row holds the counts of the values above the grand median, and the second row holds the counts of the values below the grand median. The table allows further analysis with, for example, `scipy.stats.chi2_contingency`, or with `scipy.stats.fisher_exact` if there are two samples, without having to recompute the table. If ``nan_policy`` is "propagate" and there are nans in the input, the return value for ``table`` is ``None``. See Also -------- kruskal : Compute the Kruskal-Wallis H-test for independent samples. mannwhitneyu : Computes the Mann-Whitney rank test on samples x and y. Notes ----- .. versionadded:: 0.15.0 References ---------- .. [1] Mood, A. M., Introduction to the Theory of Statistics. McGraw-Hill (1950), pp. 394-399. .. [2] Zar, J. H., Biostatistical Analysis, 5th ed. Prentice Hall (2010). See Sections 8.12 and 10.15. Examples -------- A biologist runs an experiment in which there are three groups of plants. Group 1 has 16 plants, group 2 has 15 plants, and group 3 has 17 plants. Each plant produces a number of seeds. The seed counts for each group are:: Group 1: 10 14 14 18 20 22 24 25 31 31 32 39 43 43 48 49 Group 2: 28 30 31 33 34 35 36 40 44 55 57 61 91 92 99 Group 3: 0 3 9 22 23 25 25 33 34 34 40 45 46 48 62 67 84 The following code applies Mood's median test to these samples. >>> g1 = [10, 14, 14, 18, 20, 22, 24, 25, 31, 31, 32, 39, 43, 43, 48, 49] >>> g2 = [28, 30, 31, 33, 34, 35, 36, 40, 44, 55, 57, 61, 91, 92, 99] >>> g3 = [0, 3, 9, 22, 23, 25, 25, 33, 34, 34, 40, 45, 46, 48, 62, 67, 84] >>> from scipy.stats import median_test >>> res = median_test(g1, g2, g3) The median is >>> res.median 34.0 and the contingency table is >>> res.table array([[ 5, 10, 7], [11, 5, 10]]) `p` is too large to conclude that the medians are not the same: >>> res.pvalue 0.12609082774093244 The "G-test" can be performed by passing ``lambda_="log-likelihood"`` to `median_test`. >>> res = median_test(g1, g2, g3, lambda_="log-likelihood") >>> res.pvalue 0.12224779737117837 The median occurs several times in the data, so we'll get a different result if, for example, ``ties="above"`` is used: >>> res = median_test(g1, g2, g3, ties="above") >>> res.pvalue 0.063873276069553273 >>> res.table array([[ 5, 11, 9], [11, 4, 8]]) This example demonstrates that if the data set is not large and there are values equal to the median, the p-value can be sensitive to the choice of `ties`. rdz)median_test requires two or more samples.)raboverzinvalid 'ties' option 'z'; 'ties' must be one of: r!rrzSample z7 is empty. All samples must contain at least one value.z has z; dimensions. All samples must be one-dimensional sequences.rNrrrrz'All values are below the grand median (z).z'All values are above the grand median (rzAll values in sample z are equal to the grand median (z5), so they are ignored, resulting in an empty sample.)rrQ)rirVrrr rrrrrrrraisnanr int64rrnonzeror-r-)rrQrrr ties_optionsrrYrrcdata contains_nan grand_medianrnabovenbelownequalrowsums zero_colsstatr#dofexpecteds rarErEs2\ 7|aDEE/L <24&9 #L 1!B 78:; ;.5 56BJJv  5D 5$P1 66Q;wq1ug.;<= = 66Q;wq1ugU166(;NOP P P NN4 E  3L[ \==yy'7!89 yy' HHaT^288 4Et_ " 6&))*v 45v 450 ad v  ad v 7? !Q$K6 !K W_ !Q$K6 !K " iiQiGqzQB<.PRSTTqzQB<.PRSTT x JJ //Q/78; y>A ' ! q(8'9:'*+ !  .eW9CED!S( D!\5 99} 6sK0c| t|n|}t|d|}|dtz|z z}|j|}|j |}|||fS)NTrrg)rr rsincos)rperiodrscaled_samplessin_sampcos_samps ra_circfuncs_commonrs^%'Z !RB"=GrV 34Nvvn%Hvvn%H Hh &&rbc|Sr{r|r}s rar~r~rrbc|fSr{r|rs rar~r~!rbcTt|}t|dk(r|j||S||z }t|||\}}}|j ||} |j ||} |j | | } | j dk(r| dn| } | |dtzz z|z |z|zS)az Compute the circular mean of a sample of angle observations. Given :math:`n` angle observations :math:`x_1, \cdots, x_n` measured in radians, their *circular mean* is defined by ([1]_, Eq. 2.2.4) .. math:: \mathrm{Arg} \left( \frac{1}{n} \sum_{k=1}^n e^{i x_k} \right) where :math:`i` is the imaginary unit and :math:`\mathop{\mathrm{Arg}} z` gives the principal value of the argument of complex number :math:`z`, restricted to the range :math:`[0,2\pi]` by default. :math:`z` in the above expression is known as the `mean resultant vector`. Parameters ---------- samples : array_like Input array of angle observations. The value of a full angle is equal to ``(high - low)``. high : float, optional Upper boundary of the principal value of an angle. Default is ``2*pi``. low : float, optional Lower boundary of the principal value of an angle. Default is ``0``. Returns ------- circmean : float Circular mean, restricted to the range ``[low, high]``. If the mean resultant vector is zero, an input-dependent, implementation-defined number between ``[low, high]`` is returned. If the input array is empty, ``np.nan`` is returned. See Also -------- circstd : Circular standard deviation. circvar : Circular variance. References ---------- .. [1] Mardia, K. V. and Jupp, P. E. *Directional Statistics*. John Wiley & Sons, 1999. Examples -------- For readability, all angles are printed out in degrees. >>> import numpy as np >>> from scipy.stats import circmean >>> import matplotlib.pyplot as plt >>> angles = np.deg2rad(np.array([20, 30, 330])) >>> circmean = circmean(angles) >>> np.rad2deg(circmean) 7.294976657784009 >>> mean = angles.mean() >>> np.rad2deg(mean) 126.66666666666666 Plot and compare the circular mean against the arithmetic mean. >>> plt.plot(np.cos(np.linspace(0, 2*np.pi, 500)), ... np.sin(np.linspace(0, 2*np.pi, 500)), ... c='k') >>> plt.scatter(np.cos(angles), np.sin(angles), c='k') >>> plt.scatter(np.cos(circmean), np.sin(circmean), c='b', ... label='circmean') >>> plt.scatter(np.cos(mean), np.sin(mean), c='r', label='mean') >>> plt.legend() >>> plt.axis('equal') >>> plt.show() rrrr|rg)rrrWrratan2rr) rhighlowrrrrrrsin_sumcos_sumrs rarFrFs\  !Bw1wwwTw** CZF"3GV"KGXxffXDf)GffXDf)G ((7G $CXX]#b'C 6S2X& '# - 7# ==rbc|Sr{r|r}s rar~r~rrbc|fSr{r|rs rar~r~rrbct|}||z }t|||\}}}|j||} |j||} | dz| dzzdz} |j| d} d| z } | S)a Compute the circular variance of a sample of angle observations. Given :math:`n` angle observations :math:`x_1, \cdots, x_n` measured in radians, their *circular variance* is defined by ([2]_, Eq. 2.3.3) .. math:: 1 - \left| \frac{1}{n} \sum_{k=1}^n e^{i x_k} \right| where :math:`i` is the imaginary unit and :math:`|z|` gives the length of the complex number :math:`z`. :math:`|z|` in the above expression is known as the `mean resultant length`. Parameters ---------- samples : array_like Input array of angle observations. The value of a full angle is equal to ``(high - low)``. high : float, optional Upper boundary of the principal value of an angle. Default is ``2*pi``. low : float, optional Lower boundary of the principal value of an angle. Default is ``0``. Returns ------- circvar : float Circular variance. The returned value is in the range ``[0, 1]``, where ``0`` indicates no variance and ``1`` indicates large variance. If the input array is empty, ``np.nan`` is returned. See Also -------- circmean : Circular mean. circstd : Circular standard deviation. Notes ----- In the limit of small angles, the circular variance is close to half the 'linear' variance if measured in radians. References ---------- .. [1] Fisher, N.I. *Statistical analysis of circular data*. Cambridge University Press, 1993. .. [2] Mardia, K. V. and Jupp, P. E. *Directional Statistics*. John Wiley & Sons, 1999. Examples -------- >>> import numpy as np >>> from scipy.stats import circvar >>> import matplotlib.pyplot as plt >>> samples_1 = np.array([0.072, -0.158, 0.077, 0.108, 0.286, ... 0.133, -0.473, -0.001, -0.348, 0.131]) >>> samples_2 = np.array([0.111, -0.879, 0.078, 0.733, 0.421, ... 0.104, -0.136, -0.867, 0.012, 0.105]) >>> circvar_1 = circvar(samples_1) >>> circvar_2 = circvar(samples_2) Plot the samples. >>> fig, (left, right) = plt.subplots(ncols=2) >>> for image in (left, right): ... image.plot(np.cos(np.linspace(0, 2*np.pi, 500)), ... np.sin(np.linspace(0, 2*np.pi, 500)), ... c='k') ... image.axis('equal') ... image.axis('off') >>> left.scatter(np.cos(samples_1), np.sin(samples_1), c='k', s=15) >>> left.set_title(f"circular variance: {np.round(circvar_1, 2)!r}") >>> right.scatter(np.cos(samples_2), np.sin(samples_2), c='k', s=15) >>> right.set_title(f"circular variance: {np.round(circvar_2, 2)!r}") >>> plt.show() rrrgrrrY)rrrWrW)rrrrrrrrrsin_meancos_mean hypotenuseRrs rarGrGsb  !B CZF"3GV"KGXxwwxdw+Hwwxdw+HB,2-3J  #A q&C Jrbc|Sr{r|r}s rar~r~Vrrbc|fSr{r|rs rar~r~Wrrb) normalizec4t|}||z }t|||\}}} |j||} |j| |} | dz| dzzdz} |j| d} d|j | zdzdz}|s|||z dt zz z}|S) a` Compute the circular standard deviation of a sample of angle observations. Given :math:`n` angle observations :math:`x_1, \cdots, x_n` measured in radians, their `circular standard deviation` is defined by ([2]_, Eq. 2.3.11) .. math:: \sqrt{ -2 \log \left| \frac{1}{n} \sum_{k=1}^n e^{i x_k} \right| } where :math:`i` is the imaginary unit and :math:`|z|` gives the length of the complex number :math:`z`. :math:`|z|` in the above expression is known as the `mean resultant length`. Parameters ---------- samples : array_like Input array of angle observations. The value of a full angle is equal to ``(high - low)``. high : float, optional Upper boundary of the principal value of an angle. Default is ``2*pi``. low : float, optional Lower boundary of the principal value of an angle. Default is ``0``. normalize : boolean, optional If ``False`` (the default), the return value is computed from the above formula with the input scaled by ``(2*pi)/(high-low)`` and the output scaled (back) by ``(high-low)/(2*pi)``. If ``True``, the output is not scaled and is returned directly. Returns ------- circstd : float Circular standard deviation, optionally normalized. If the input array is empty, ``np.nan`` is returned. See Also -------- circmean : Circular mean. circvar : Circular variance. Notes ----- In the limit of small angles, the circular standard deviation is close to the 'linear' standard deviation if ``normalize`` is ``False``. References ---------- .. [1] Mardia, K. V. (1972). 2. In *Statistics of Directional Data* (pp. 18-24). Academic Press. :doi:`10.1016/C2013-0-07425-7`. .. [2] Mardia, K. V. and Jupp, P. E. *Directional Statistics*. John Wiley & Sons, 1999. Examples -------- >>> import numpy as np >>> from scipy.stats import circstd >>> import matplotlib.pyplot as plt >>> samples_1 = np.array([0.072, -0.158, 0.077, 0.108, 0.286, ... 0.133, -0.473, -0.001, -0.348, 0.131]) >>> samples_2 = np.array([0.111, -0.879, 0.078, 0.733, 0.421, ... 0.104, -0.136, -0.867, 0.012, 0.105]) >>> circstd_1 = circstd(samples_1) >>> circstd_2 = circstd(samples_2) Plot the samples. >>> fig, (left, right) = plt.subplots(ncols=2) >>> for image in (left, right): ... image.plot(np.cos(np.linspace(0, 2*np.pi, 500)), ... np.sin(np.linspace(0, 2*np.pi, 500)), ... c='k') ... image.axis('equal') ... image.axis('off') >>> left.scatter(np.cos(samples_1), np.sin(samples_1), c='k', s=15) >>> left.set_title(f"circular std: {np.round(circstd_1, 2)!r}") >>> right.plot(np.cos(np.linspace(0, 2*np.pi, 500)), ... np.sin(np.linspace(0, 2*np.pi, 500)), ... c='k') >>> right.scatter(np.cos(samples_2), np.sin(samples_2), c='k', s=15) >>> right.set_title(f"circular std: {np.round(circstd_2, 2)!r}") >>> plt.show() rrrgrrrrhr)rrrWrWrr)rrrrrrrrrrrrrrrs rarHrHUsv  !B CZF"3GV"KGXxwwxdw+Hwwxdw+HB,2-3J  #A bffQi<# c !C  S2b5!! JrbceZdZdZdZy)DirectionalStatsc ||_||_yr{mean_directionmean_resultant_length)r9rrs rar,zDirectionalStats.__init__s,%:"rbc<d|jd|jdS)Nz DirectionalStats(mean_direction=z, mean_resultant_length=)rr8s rar:zDirectionalStats.__repr__s0243F3F2GH**.*D*D)EQH IrbN)r;r<r=r,r:r|rbrarrs ;Irbr)rrct|}|j|}|jdkr!tdt |j |j ||d}|rt|dd|}||z }|j|d}t|dd|}||z }|j|d}|jdk(r|dn|}t||S) a Computes sample statistics for directional data. Computes the directional mean (also called the mean direction vector) and mean resultant length of a sample of vectors. The directional mean is a measure of "preferred direction" of vector data. It is analogous to the sample mean, but it is for use when the length of the data is irrelevant (e.g. unit vectors). The mean resultant length is a value between 0 and 1 used to quantify the dispersion of directional data: the smaller the mean resultant length, the greater the dispersion. Several definitions of directional variance involving the mean resultant length are given in [1]_ and [2]_. Parameters ---------- samples : array_like Input array. Must be at least two-dimensional, and the last axis of the input must correspond with the dimensionality of the vector space. When the input is exactly two dimensional, this means that each row of the data is a vector observation. axis : int, default: 0 Axis along which the directional mean is computed. normalize: boolean, default: True If True, normalize the input to ensure that each observation is a unit vector. It the observations are already unit vectors, consider setting this to False to avoid unnecessary computation. Returns ------- res : DirectionalStats An object containing attributes: mean_direction : ndarray Directional mean. mean_resultant_length : ndarray The mean resultant length [1]_. See Also -------- circmean: circular mean; i.e. directional mean for 2D *angles* circvar: circular variance; i.e. directional variance for 2D *angles* Notes ----- This uses a definition of directional mean from [1]_. Assuming the observations are unit vectors, the calculation is as follows. .. code-block:: python mean = samples.mean(axis=0) mean_resultant_length = np.linalg.norm(mean) mean_direction = mean / mean_resultant_length This definition is appropriate for *directional* data (i.e. vector data for which the magnitude of each observation is irrelevant) but not for *axial* data (i.e. vector data for which the magnitude and *sign* of each observation is irrelevant). Several definitions of directional variance involving the mean resultant length ``R`` have been proposed, including ``1 - R`` [1]_, ``1 - R**2`` [2]_, and ``2 * (1 - R)`` [2]_. Rather than choosing one, this function returns ``R`` as attribute `mean_resultant_length` so the user can compute their preferred measure of dispersion. References ---------- .. [1] Mardia, Jupp. (2000). *Directional Statistics* (p. 163). Wiley. .. [2] https://en.wikipedia.org/wiki/Directional_statistics Examples -------- >>> import numpy as np >>> from scipy.stats import directional_stats >>> data = np.array([[3, 4], # first observation, 2D vector space ... [6, -8]]) # second observation >>> dirstats = directional_stats(data) >>> dirstats.mean_direction array([1., 0.]) In contrast, the regular sample mean of the vectors would be influenced by the magnitude of each observation. Furthermore, the result would not be a unit vector. >>> data.mean(axis=0) array([4.5, -2.]) An exemplary use case for `directional_stats` is to find a *meaningful* center for a set of observations on a sphere, e.g. geographical locations. >>> data = np.array([[0.8660254, 0.5, 0.], ... [0.8660254, -0.5, 0.]]) >>> dirstats = directional_stats(data) >>> dirstats.mean_direction array([1., 0., 0.]) The regular sample mean on the other hand yields a result which does not lie on the surface of the sphere. >>> data.mean(axis=0) array([0.8660254, 0., 0.]) The function also returns the mean resultant length, which can be used to calculate a directional variance. For example, using the definition ``Var(z) = 1 - R`` from [2]_ where ``R`` is the mean resultant length, we can calculate the directional variance of the vectors in the above example as: >>> 1 - dirstats.mean_resultant_length 0.13397459716167093 rdzEsamples must at least be two-dimensional. Instead samples has shape: rrT)rrrrr|) rr rrVrrrTrrWsqueezer) rrrr vectornormsrWrrmrls rarNrNsf  !Bjj!G||a77>> ps = [0.0001, 0.0004, 0.0019, 0.0095, 0.0201, 0.0278, 0.0298, 0.0344, ... 0.0459, 0.3240, 0.4262, 0.5719, 0.6528, 0.7590, 1.000] If the chosen significance level is 0.05, we may be tempted to reject the null hypotheses for the tests corresponding with the first nine p-values, as the first nine p-values fall below the chosen significance level. However, this would ignore the problem of "multiplicity": if we fail to correct for the fact that multiple comparisons are being performed, we are more likely to incorrectly reject true null hypotheses. One approach to the multiplicity problem is to control the family-wise error rate (FWER), that is, the rate at which the null hypothesis is rejected when it is actually true. A common procedure of this kind is the Bonferroni correction [1]_. We begin by multiplying the p-values by the number of hypotheses tested. >>> import numpy as np >>> np.array(ps) * len(ps) array([1.5000e-03, 6.0000e-03, 2.8500e-02, 1.4250e-01, 3.0150e-01, 4.1700e-01, 4.4700e-01, 5.1600e-01, 6.8850e-01, 4.8600e+00, 6.3930e+00, 8.5785e+00, 9.7920e+00, 1.1385e+01, 1.5000e+01]) To control the FWER at 5%, we reject only the hypotheses corresponding with adjusted p-values less than 0.05. In this case, only the hypotheses corresponding with the first three p-values can be rejected. According to [1]_, these three hypotheses concerned "allergic reaction" and "two different aspects of bleeding." An alternative approach is to control the false discovery rate: the expected fraction of rejected null hypotheses that are actually true. The advantage of this approach is that it typically affords greater power: an increased rate of rejecting the null hypothesis when it is indeed false. To control the false discovery rate at 5%, we apply the Benjamini-Hochberg p-value adjustment. >>> from scipy import stats >>> stats.false_discovery_control(ps) array([0.0015 , 0.003 , 0.0095 , 0.035625 , 0.0603 , 0.06385714, 0.06385714, 0.0645 , 0.0765 , 0.486 , 0.58118182, 0.714875 , 0.75323077, 0.81321429, 1. ]) Now, the first *four* adjusted p-values fall below 0.05, so we would reject the null hypotheses corresponding with these *four* p-values. Rejection of the fourth null hypothesis was particularly important to the original study as it led to the conclusion that the new treatment had a "substantially lower in-hospital mortality rate." rr!z/`ps` must include only numbers between 0 and 1.rbyzUnrecognized `method` 'z'.Method must be one of r?Nr|z#`axis` must be an integer or `None`rr.)r}r)valuesr)rr rVrnumberr-rWrVrrrvrrrTargsorttake_along_axisr rrC accumulateput_along_axisr)psrr+ ps_in_rangerdr[orderrs rarOrONs\ BB==29957vvbBGGB1$556 JKKTlG ||~W$26(;229!=> > \\^F | XXZ ::d B D ==RZZ 0DIIN>?? ww!|rxx~*"v Rr "B  A JJr #E  BB /B !QqSA!a%KB~ bffQUm JJ"S$B$Y-RTrT ]Db% ; RT "B 772q! rb)r)rd)T)r|rkTNF))rr tukeylambda)rNP)r)NNN)NrN)Nrr{)rk)r=r)rr=)NwilcoxFr=r)rlr_ threading collectionsrnumpyrrrrrrr r r r r rrrrrrrscipyrrrrscipy._lib._bunchrscipy._lib._utilrrrscipy._lib._array_apirrrr _ansari_swilk_statisticsr"r#r$r%_fitr&r'r(r)r*r+r, contingencyr-r._distn_infrastructurer/_axis_nan_policyr0r1__all__rPrSrTr3r2r4r5rrrr6r7r8rrr9r r'r:r3r5rUr;rpr<rKrwrJrLrMrr=rrrr_Avals_weibullr_cvals_weibullinterp1dr_rrrr>rrrrIr%r'localrAr?rNr@r`rArrrBrOrrrCrrrrrDrrErrFrGrHrrNrOr|rbrar's "2222287/GG4"GG)-I &12 j"9 : Y 7 8`FHV!2adc0Tc0c0L!2ad777777t< ~> &`FU=pdNQ#$e m;`+*;==4 FQ!h-   k15`>Q`F'TA3HdN0gTVOrC7L?,CD -1PTUV8VV8x78 89 9: BC KJJJJJJJJJJ L.)QR(&+&&~~7G7G,452@2DF >B##3$:K JDZ(V!H). @D@F.*AB 3333rY__  ,!4a,5a,H,.EF.D9j%:j%Z.*AB ,$7$dK!8K!\?,CD -48%tM*9M*`*a1EF*F*FF*R,_UO/VO/d##3k85LM) 68$4F,8H :?-3f0=>f0 % f0R%. '4&T:n '14%R4QTkV> V>r14%B$AD[W Wt14%B$AD[cc cLII()DBCJ)*$Drb