`L iU.dZddlZddlmZddlmZmZddlmZm Z m Z ddl m Z m Z ddlZddlZddlmZddlmZdd lmZdd lmZmZmZd d lmZd d lmZm Z m!Z!m"Z"d dl#m$Z$edddedddedddedddedddfZ%edd d!Z&ejNe(Z)ee*edgehd"gd#gd$gd#gd#geed dd%&geed'dd(&gd)d*+dd,d*dd-d-d.d/d)d0Z+d1Z,d2Z-y)3zhRCV1 dataset. The dataset page is available at http://jmlr.csail.mit.edu/papers/volume5/lewis04a/ N)GzipFile)IntegralReal)PathLikemakedirsremove)existsjoin)Bunch)shuffle)Interval StrOptionsvalidate_params) get_data_home)RemoteFileMetadata _fetch_remote _pkl_filepath load_descr)load_svmlight_filesz.https://ndownloader.figshare.com/files/5976069@ed40f7e418d10484091b059703eeb95ae3199fe042891dcec4be6696b9968374z lyrl2004_vectors_test_pt0.dat.gz)urlchecksumfilenamez.https://ndownloader.figshare.com/files/5976066@87700668ae45d45d5ca1ef6ae9bd81ab0f5ec88cc95dcef9ae7838f727a13aa6z lyrl2004_vectors_test_pt1.dat.gzz.https://ndownloader.figshare.com/files/5976063@48143ac703cbe33299f7ae9f4995db49a258690f60e5debbff8995c34841c7f5z lyrl2004_vectors_test_pt2.dat.gzz.https://ndownloader.figshare.com/files/5976060@dfcb0d658311481523c6e6ca0c3f5a3e1d3d12cde5d7a8ce629a9006ec7dbb39z lyrl2004_vectors_test_pt3.dat.gzz.https://ndownloader.figshare.com/files/5976057@5468f656d0ba7a83afc7ad44841cf9a53048a5c083eedc005dcdb5cc768924aezlyrl2004_vectors_train.dat.gzz.https://ndownloader.figshare.com/files/5976048@2a98e5e5d8b770bded93afc8930d88299474317fe14181aee1466cc754d0d1c1zrcv1v2.topics.qrels.gz>alltesttrainboolean random_stateleft)closedgneither) data_homesubsetdownload_if_missingr%r return_X_y n_retriesdelayT)prefer_skip_nested_validationr!Fg?cB d}d} d} d} t|}t|d} |rt| s t| t | d} t | d}t | d }t | d }|r[t| r t|sDg}t D]N}t jd |jzt|| || }|jt| Pt|| }tj|d|d|d|d|dgj}t!j"|d|d|d|d|df}|j%t j&d}t)j*|| dt)j*||d|D]'}|j-t/|j0)n*t)j2| }t)j2|}|rt|r t|st jd t4jztt4| || }d}d}d}t!j6|| ft j8}t!j6|t j:}i}t|d5}|D]k}|j=d j?d!} tA| dk(s2| \}!}"}#|!|vr |dz }|||!<tC|"}"|"|k7r |"}|dz }|"||<d||||!f<m d"d"d"t/|tE||}$||$d"d"f}t!jF| tH}%|jKD] }&|&|%||&< t!jL|%}'|%|'}%tjN|d"d"|'f}t)j*||dt)j*|%|dn*t)j2|}t)j2|}%|d#k(rnP|d$k(r|d"| d"d"f}|d"| d"d"f}|d"| }n/|d%k(r|| d"d"d"f}|| d"d"d"f}|| d"}ntQd&|z|rtS||||'\}}}tUd(}(|r||fStW||||%|()S#1swYxYw)*a- Load the RCV1 multilabel dataset (classification). Download it if necessary. Version: RCV1-v2, vectors, full sets, topics multilabels. ================= ===================== Classes 103 Samples total 804414 Dimensionality 47236 Features real, between 0 and 1 ================= ===================== Read more in the :ref:`User Guide `. .. versionadded:: 0.17 Parameters ---------- data_home : str or path-like, default=None Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders. subset : {'train', 'test', 'all'}, default='all' Select the dataset to load: 'train' for the training set (23149 samples), 'test' for the test set (781265 samples), 'all' for both, with the training samples first if shuffle is False. This follows the official LYRL2004 chronological split. download_if_missing : bool, default=True If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site. random_state : int, RandomState instance or None, default=None Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `. shuffle : bool, default=False Whether to shuffle dataset. return_X_y : bool, default=False If True, returns ``(dataset.data, dataset.target)`` instead of a Bunch object. See below for more information about the `dataset.data` and `dataset.target` object. .. versionadded:: 0.20 n_retries : int, default=3 Number of retries when HTTP errors are encountered. .. versionadded:: 1.5 delay : float, default=1.0 Number of seconds between retries. .. versionadded:: 1.5 Returns ------- dataset : :class:`~sklearn.utils.Bunch` Dictionary-like object. Returned only if `return_X_y` is False. `dataset` has the following attributes: - data : sparse matrix of shape (804414, 47236), dtype=np.float64 The array has 0.16% of non zero values. Will be of CSR format. - target : sparse matrix of shape (804414, 103), dtype=np.uint8 Each sample has a value of 1 in its categories, and 0 in others. The array has 3.15% of non zero values. Will be of CSR format. - sample_id : ndarray of shape (804414,), dtype=np.uint32, Identification number of each sample, as ordered in dataset.data. - target_names : ndarray of shape (103,), dtype=object Names of each target (RCV1 topics), as ordered in dataset.target. - DESCR : str Description of the RCV1 dataset. (data, target) : tuple A tuple consisting of `dataset.data` and `dataset.target`, as described above. Returned only if `return_X_y` is True. .. versionadded:: 0.20 Examples -------- >>> from sklearn.datasets import fetch_rcv1 >>> rcv1 = fetch_rcv1() >>> rcv1.data.shape (804414, 47236) >>> rcv1.target.shape (804414, 103) i>F igimZ)r)RCV1z samples.pklz sample_id.pklzsample_topics.pklztopics_names.pklzDownloading %s)dirnamer-r.)r) n_featuresrr  rr0F)copy)compressdtyperb)rmodeascii Nr!r#r"zLUnknown subset parameter. Got '%s' instead of one of ('all', 'train', test'))r%zrcv1.rst)datatarget sample_id target_namesDESCR),rr r rr XY_METADATAloggerinforrappendrrspvstacktocsrnphstackastypeuint32joblibdumpclosernameloadTOPICS_METADATAzerosuint8int32decodesplitlenint_find_permutationemptyobjectkeysargsort csr_matrix ValueErrorshuffle_rr ))r)r*r+r%r r,r-r. N_SAMPLES N_FEATURES N_CATEGORIESN_TRAINrcv1_dir samples_pathsample_id_pathsample_topics_path topics_pathfileseach file_pathXyXrGftopics_archive_pathn_catn_doc doc_previousy sample_id_biscategory_nameslineline_componentscatdoc_ permutation categorieskorderfdescrs) \/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/sklearn/datasets/_rcv1.py fetch_rcv1rLsfIJLG 2IIv&Hh X  =9L"8_=N&x1DE*<=KF<$8~@V 7D KK(4883 4%h)5I LL95 6  7!: > IIr!ubeRUBqE2a59 : @ @ BIIr!ubeRUBqE2a5AB $$RYYU$;  A|a0 I~: A GGI 166N  KK %KK/  % &f[.A $':'::;+ X%   HHi.bhh ?"((;  2 > 6! 6"&++g"6"<"rse")) 'KKOO4 <S3  <S3  <S3  <S3  <S0+ <%8 O %   8 $8T*678 ){'(; kxD@A4d9=> #'    d dN r