L ioSddlZddlZddlmZmZmZmZddlm Z ddl m Z ddl m Z ddlmZddlmZd gZGd d Zdd ZdZe dgdgZed dddfdZdZdZdZdZy)N)check_random_state MapWrapper rng_integers _contains_nan)_make_tuple_bunchcdist) _measurements)_local_correlations) distributionsmultiscale_graphcorrceZdZdZdZdZy) _ParallelPz.Helper function to calculate parallel p-value.c.||_||_||_yNxy random_states)selfrrrs V/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/scipy/stats/_mgc.py__init__z_ParallelP.__init__s*c|j|j|jjd}|j|dd|f}t |j |d}|S)Nr)r permutationrshape _mgc_statr)rindexorderpermy perm_stats r__call__z_ParallelP.__call__s\""5)55dffll1oFu ah'dffe,Q/ rN)__name__ __module__ __qualname____doc__rr#rrrrs8+ rrc t|}t|Dcgc]<}tjj t |ddtj >}}t|||}t|5} tjt| |t|} dddd |k\jzd|zz } | | fScc}w#1swY.xYw)aHelper function that calculates the p-value. See below for uses. Parameters ---------- x, y : ndarray `x` and `y` have shapes ``(n, p)`` and ``(n, q)``. stat : float The sample test statistic. reps : int, optional The number of replications used to estimate the null when using the permutation test. The default is 1000 replications. workers : int or map-like callable, optional If `workers` is an int the population is subdivided into `workers` sections and evaluated in parallel (uses `multiprocessing.Pool `). Supply `-1` to use all cores available to the Process. Alternatively supply a map-like callable, such as `multiprocessing.Pool.map` for evaluating the population in parallel. This evaluation is carried out as `workers(func, iterable)`. Requires that `func` be pickleable. random_state : {None, int, `numpy.random.Generator`, `numpy.random.RandomState`}, optional If `seed` is None (or `np.random`), the `numpy.random.RandomState` singleton is used. If `seed` is an int, a new ``RandomState`` instance is used, seeded with `seed`. If `seed` is already a ``Generator`` or ``RandomState`` instance then that instance is used. Returns ------- pvalue : float The sample test p-value. null_dist : list The approximated null distribution. l)sizedtyperNr ) rrangenprandom RandomStateruint32rrarraylistsum) rrstatrepsworkers random_state_r parallelp mapwrapper null_distpvalues r _perm_testr?#sP&l3L8=d E34YY**< g299,./EMEQ!=AI G G HHT*Yd "DEF G9$))++D 9F 9 E GGsAC 5/CCct||Srr)rs r_euclidean_distrAZs A;r MGCResult) statisticr>mgc_dictFcht|tjrt|tjs td|jdk(r|ddtj f}n'|jdk7rtd|j |jdk(r|ddtj f}n'|jdk7rtd|j |j \}}|j \} } t|dt|dtjtj|d kDs+tjtj|d kDr td || k7r|| k(rd }n td |d ks| d kr td|jtj}|jtj}t|s | tdt|tr|d kr td|dkrd} tj| t d|r| tdt#||\}}|||}||}t%||\} } | d}| d}t'||| |||\}}|||d}t)| ||}| |_|S)a#Computes the Multiscale Graph Correlation (MGC) test statistic. Specifically, for each point, MGC finds the :math:`k`-nearest neighbors for one property (e.g. cloud density), and the :math:`l`-nearest neighbors for the other property (e.g. grass wetness) [1]_. This pair :math:`(k, l)` is called the "scale". A priori, however, it is not know which scales will be most informative. So, MGC computes all distance pairs, and then efficiently computes the distance correlations for all scales. The local correlations illustrate which scales are relatively informative about the relationship. The key, therefore, to successfully discover and decipher relationships between disparate data modalities is to adaptively determine which scales are the most informative, and the geometric implication for the most informative scales. Doing so not only provides an estimate of whether the modalities are related, but also provides insight into how the determination was made. This is especially important in high-dimensional data, where simple visualizations do not reveal relationships to the unaided human eye. Characterizations of this implementation in particular have been derived from and benchmarked within in [2]_. Parameters ---------- x, y : ndarray If ``x`` and ``y`` have shapes ``(n, p)`` and ``(n, q)`` where `n` is the number of samples and `p` and `q` are the number of dimensions, then the MGC independence test will be run. Alternatively, ``x`` and ``y`` can have shapes ``(n, n)`` if they are distance or similarity matrices, and ``compute_distance`` must be sent to ``None``. If ``x`` and ``y`` have shapes ``(n, p)`` and ``(m, p)``, an unpaired two-sample MGC test will be run. compute_distance : callable, optional A function that computes the distance or similarity among the samples within each data matrix. Set to ``None`` if ``x`` and ``y`` are already distance matrices. The default uses the euclidean norm metric. If you are calling a custom function, either create the distance matrix before-hand or create a function of the form ``compute_distance(x)`` where `x` is the data matrix for which pairwise distances are calculated. reps : int, optional The number of replications used to estimate the null when using the permutation test. The default is ``1000``. workers : int or map-like callable, optional If ``workers`` is an int the population is subdivided into ``workers`` sections and evaluated in parallel (uses ``multiprocessing.Pool ``). Supply ``-1`` to use all cores available to the Process. Alternatively supply a map-like callable, such as ``multiprocessing.Pool.map`` for evaluating the p-value in parallel. This evaluation is carried out as ``workers(func, iterable)``. Requires that `func` be pickleable. The default is ``1``. is_twosamp : bool, optional If `True`, a two sample test will be run. If ``x`` and ``y`` have shapes ``(n, p)`` and ``(m, p)``, this optional will be overridden and set to ``True``. Set to ``True`` if ``x`` and ``y`` both have shapes ``(n, p)`` and a two sample test is desired. The default is ``False``. Note that this will not run if inputs are distance matrices. random_state : {None, int, `numpy.random.Generator`, `numpy.random.RandomState`}, optional If `seed` is None (or `np.random`), the `numpy.random.RandomState` singleton is used. If `seed` is an int, a new ``RandomState`` instance is used, seeded with `seed`. If `seed` is already a ``Generator`` or ``RandomState`` instance then that instance is used. Returns ------- res : MGCResult An object containing attributes: statistic : float The sample MGC test statistic within ``[-1, 1]``. pvalue : float The p-value obtained via permutation. mgc_dict : dict Contains additional useful results: - mgc_map : ndarray A 2D representation of the latent geometry of the relationship. - opt_scale : (int, int) The estimated optimal scale as a ``(x, y)`` pair. - null_dist : list The null distribution derived from the permuted matrices. See Also -------- pearsonr : Pearson correlation coefficient and p-value for testing non-correlation. kendalltau : Calculates Kendall's tau. spearmanr : Calculates a Spearman rank-order correlation coefficient. Notes ----- A description of the process of MGC and applications on neuroscience data can be found in [1]_. It is performed using the following steps: #. Two distance matrices :math:`D^X` and :math:`D^Y` are computed and modified to be mean zero columnwise. This results in two :math:`n \times n` distance matrices :math:`A` and :math:`B` (the centering and unbiased modification) [3]_. #. For all values :math:`k` and :math:`l` from :math:`1, ..., n`, * The :math:`k`-nearest neighbor and :math:`l`-nearest neighbor graphs are calculated for each property. Here, :math:`G_k (i, j)` indicates the :math:`k`-smallest values of the :math:`i`-th row of :math:`A` and :math:`H_l (i, j)` indicates the :math:`l` smallested values of the :math:`i`-th row of :math:`B` * Let :math:`\circ` denotes the entry-wise matrix product, then local correlations are summed and normalized using the following statistic: .. math:: c^{kl} = \frac{\sum_{ij} A G_k B H_l} {\sqrt{\sum_{ij} A^2 G_k \times \sum_{ij} B^2 H_l}} #. The MGC test statistic is the smoothed optimal local correlation of :math:`\{ c^{kl} \}`. Denote the smoothing operation as :math:`R(\cdot)` (which essentially set all isolated large correlations) as 0 and connected large correlations the same as before, see [3]_.) MGC is, .. math:: MGC_n (x, y) = \max_{(k, l)} R \left(c^{kl} \left( x_n, y_n \right) \right) The test statistic returns a value between :math:`(-1, 1)` since it is normalized. The p-value returned is calculated using a permutation test. This process is completed by first randomly permuting :math:`y` to estimate the null distribution and then calculating the probability of observing a test statistic, under the null, at least as extreme as the observed test statistic. MGC requires at least 5 samples to run with reliable results. It can also handle high-dimensional data sets. In addition, by manipulating the input data matrices, the two-sample testing problem can be reduced to the independence testing problem [4]_. Given sample data :math:`U` and :math:`V` of sizes :math:`p \times n` :math:`p \times m`, data matrix :math:`X` and :math:`Y` can be created as follows: .. math:: X = [U | V] \in \mathcal{R}^{p \times (n + m)} Y = [0_{1 \times n} | 1_{1 \times m}] \in \mathcal{R}^{(n + m)} Then, the MGC statistic can be calculated as normal. This methodology can be extended to similar tests such as distance correlation [4]_. .. versionadded:: 1.4.0 References ---------- .. [1] Vogelstein, J. T., Bridgeford, E. W., Wang, Q., Priebe, C. E., Maggioni, M., & Shen, C. (2019). Discovering and deciphering relationships across disparate data modalities. ELife. .. [2] Panda, S., Palaniappan, S., Xiong, J., Swaminathan, A., Ramachandran, S., Bridgeford, E. W., ... Vogelstein, J. T. (2019). mgcpy: A Comprehensive High Dimensional Independence Testing Python Package. :arXiv:`1907.02088` .. [3] Shen, C., Priebe, C.E., & Vogelstein, J. T. (2019). From distance correlation to multiscale graph correlation. Journal of the American Statistical Association. .. [4] Shen, C. & Vogelstein, J. T. (2018). The Exact Equivalence of Distance and Kernel Methods for Hypothesis Testing. :arXiv:`1806.05514` Examples -------- >>> import numpy as np >>> from scipy.stats import multiscale_graphcorr >>> x = np.arange(100) >>> y = x >>> res = multiscale_graphcorr(x, y) >>> res.statistic, res.pvalue (1.0, 0.001) To run an unpaired two-sample test, >>> x = np.arange(100) >>> y = np.arange(79) >>> res = multiscale_graphcorr(x, y) >>> res.statistic, res.pvalue # doctest: +SKIP (0.033258146255703246, 0.023) or, if shape of the inputs are the same, >>> x = np.arange(100) >>> y = x >>> res = multiscale_graphcorr(x, y, is_twosamp=True) >>> res.statistic, res.pvalue # doctest: +SKIP (-0.008021809890200488, 1.0) zx and y must be ndarraysr Nz&Expected a 2-D array `x`, found shape z&Expected a 2-D array `y`, found shape raise) nan_policyrzInputs contain infinitiesTzZShape mismatch, x and y must have shape [n, p] and [n, q] or have shape [n, p] and [m, p].z;MGC requires at least 5 samples to give reasonable results.z$Compute_distance must be a function.z1Number of reps must be an integer greater than 0.r)zThe number of replications is low (under 1000), and p-value calculations may be unreliable. Use the p-value result, with caution!) stacklevelz*Cannot run if inputs are distance matrices stat_mgc_map opt_scale)r7r8r9)mgc_maprLr=) isinstancer/ndarray ValueErrorndimnewaxisrrr5isinfastypefloat64callableintwarningswarnRuntimeWarning_two_sample_transformrr?rBr6)rrcompute_distancer7r8 is_twosampr9nxpxnypymsgr6 stat_dictrKrLr>r=rDress rrrbsN a $Jq"**,E344 vv{ am  1A!''KLLvv{ am  1A!''KLL WWFB WWFB!(!( vvbhhqkQ"&&!"5"9455 Rx 8JKL L Ava$% % A A $ %*:*F?@@ dC D1HLMM   c>a8  #IJ J$Q*1# Q  Q  1oOD)^,L+&I#1aD'0<>FI(&&(H D&( +CCH Jrct||d}|j\}}|dk(s|dk(r||dz |dz }||z}n)t|dz }t||}t ||\}}||d} || fS)aHelper function that calculates the MGC stat. See above for use. Parameters ---------- distx, disty : ndarray `distx` and `disty` have shapes ``(n, p)`` and ``(n, q)`` or ``(n, n)`` and ``(n, n)`` if distance matrices. Returns ------- stat : float The sample MGC test statistic within ``[-1, 1]``. stat_dict : dict Contains additional useful additional returns containing the following keys: - stat_mgc_map : ndarray MGC-map of the statistics. - opt_scale : (float, float) The estimated optimal scale as a ``(x, y)`` pair. mgc) global_corrr )rKrL)r rlen_threshold_mgc_map_smooth_mgc_map) distxdistyrKnmr6rL samp_size sig_connectrcs rrr}s2'ueGL   DAqAvaAE"1q5)E JN )yA *+|Di!-')I ?rc|j\}}dd|z z }||dz zdz dz }tjj|||dzdz }t |||dz |dz }||kD}t j |dkDrTtj|\}}t j|d \}}t j|dd dz} || k(}|St jd gg}|S) at Finds a connected region of significance in the MGC-map by thresholding. Parameters ---------- stat_mgc_map : ndarray All local correlations within ``[-1,1]``. samp_size : int The sample size of original data. Returns ------- sig_connect : ndarray A binary matrix with 1's indicating the significant region. r {Gz?r+g?rFrT) return_countsNF) rr betappfmaxr/r5r labeluniqueargmaxr3) rKrornrmper_sig thresholdrpr: label_counts max_labels rriris "   DAq 4)#$GY]+A-3I""&&w 9EIAMII|AE21q59:I*K vvkQ&,,[9 Q))KtD<IIl12./!3 !Y.  hhy) rc|j\}}||dz |dz }||g}tjj|dk7rtj|tj dt ||zt||zk\rwt ||}tj||k\|z}||k\rI|}|\}} ||z| z} tj | |z}tj | |z} |dz| dzg}||fS)aRFinds the smoothed maximal within the significant region R. If area of R is too small it returns the last local correlation. Otherwise, returns the maximum within significant_connected_region. Parameters ---------- sig_connect : ndarray A binary matrix with 1's indicating the significant region. stat_mgc_map : ndarray All local correlations within ``[-1, 1]``. Returns ------- stat : float The sample MGC statistic within ``[-1, 1]``. opt_scale: (float, float) The estimated optimal scale as an ``(x, y)`` pair. r rrr) rr/linalgnormr5ceilrwminwhere) rprKrnrmr6rLmax_corrmax_corr_indexkl one_d_indicess rrjrjs *   DAq A q1u %DAI yy~~k"a' 66+ "''$Q*:";c!Qi"G G< 45H XX|x'?;&NON4%1 !A FF=)Q.FF=)A-qS!A#J ?rc|jd}|jd}tj||gd}tjtj|tj|gdj dd}||fS)aHelper function that concatenates x and y for two sample MGC stat. See above for use. Parameters ---------- u, v : ndarray `u` and `v` have shapes ``(n, p)`` and ``(m, p)``. Returns ------- x : ndarray Concatenate `u` and `v` along the ``axis = 0``. `x` thus has shape ``(2n, p)``. y : ndarray Label matrix for `x` where 0 refers to samples that comes from `u` and 1 refers to samples that come from `v`. `y` thus has shape ``(2n, 1)``. r)axisr )rr/ concatenatezerosonesreshape)uvr^r`rrs rr[r[sm( B B 1vA&A  bggbk2;CCBJA a4Kr)r)rN)rXnumpyr/scipy._lib._utilrrrrscipy._lib._bunchrscipy.spatial.distancer scipy.ndimager _statsr r __all__rr?rArBrrrirjr[r(rrrs~XX/('' ! " $4n kA2 G 1@d!"u4Xv.b*Z0fr