`L iHQdZddlZddlZddlZddlZddlZddlZddlZddlm Z ddl m Z m Z ddl Z ddlZddlmZddlmZddlmZddlmZmZdd lmZmZmZdd lmZd d lm Z m!Z!d d l"m#Z#m$Z$m%Z%m&Z&m'Z'ejPe)Z*e#dddZ+dZ,dZ-dZ.dZ/dZ0ejbdZ2dZ3dZ4ee5ejldgehdgddgdgdge7gdgdgee d ddgee d dd!gd" d#$dd%dd#d&d'd#d(d)d*d" d+Z8eehdge7ge5ejldgdgdgdgdgee d ddgee d dd!gd, d#$d%d'dd#d(d#d(d)d*d, d-Z9y).aCaching loader for the 20 newsgroups text classification dataset. The description of the dataset is available on the official website at: http://people.csail.mit.edu/jrennie/20Newsgroups/ Quoting the introduction: The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. This dataset loader will download the recommended "by date" variant of the dataset and which features a point in time split between the train and test sets. The compressed dataset size is around 14 Mb compressed. Once uncompressed the train set is 52 MB and the test set is 34 MB. N)suppress)IntegralReal) preprocessing)CountVectorizer)Bunchcheck_random_state)Interval StrOptionsvalidate_params)tarfile_extractall) get_data_home load_files)RemoteFileMetadata_convert_data_dataframe _fetch_remote _pkl_filepath load_descrz20news-bydate.tar.gzz.https://ndownloader.figshare.com/files/5975967@8f1b2514ca22a5ade8fbb9cfa5727df95fa587f4c87b786e15c759fa66d95610)filenameurlchecksumz20news-bydate.pkzz20news-bydate-trainz20news-bydate-testcftjj|t}tjj|t}tj |dt jdtjtt|||}t jd|tj|d5}t||dddtt 5tj"|dddt%t'|d t'|d  }t)j*t-j.|d } t|d 5} | j1| dddt3j4||S#1swYxYw#1swYxYw#1swY8xYw)zADownload the 20 newsgroups data and stored it as a zipped pickle.T)exist_okz#Downloading dataset from %s (14 MB))dirname n_retriesdelayzDecompressing %szr:gz)pathNlatin1)encodingtraintest zlib_codecwb)osr join TRAIN_FOLDER TEST_FOLDERmakedirsloggerinfoARCHIVErrdebugtarfileopenrrFileNotFoundErrorremovedictrcodecsencodepickledumpswriteshutilrmtree) target_dir cache_pathrr train_path test_path archive_pathfpcachecompressed_contentfs i/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/sklearn/datasets/_twenty_newsgroups.py_download_20newsgroupsrGGsGj,7J Z5IKK T* KK5w{{C yL LL#\2 lF +0r2J/0 # $  ,  h7  H 5 E v||E':LI j$ $1 "#$ MM* L!00  $$s$F(FF'FF$'F0c0|jd\}}}|S)z Given text in "news" format, strip the headers, by removing everything before the first blank line. Parameters ---------- text : str The text from which to remove the signature block. z ) partition)text_before _blanklineafters rFstrip_newsgroup_headerrNgs"&!7GZ LzF(writes in|writes:|wrote:|says:|said:|^In article|^Quoted from|^\||^>)c|jdDcgc]}tj|r|}}dj|Scc}w)a: Given text in "news" format, strip lines beginning with the quote characters > or |, plus lines that often introduce a quoted section (for example, because they contain the string 'writes:'.) Parameters ---------- text : str The text from which to remove the signature block.  )split _QUOTE_REsearchr))rJline good_liness rFstrip_newsgroup_quotingrWzsB$(::d#3R49;K;KD;Q$RJR 99Z  Ss AAc|jjd}tt|dz ddD]+}||}|jjddk(s+ndkDrdj |d|S|S)a Given text in "news" format, attempt to remove a signature block. As a rough heuristic, we assume that signatures are set apart by either a blank line or a line made of hyphens, and that it is the last such line in the file (disregarding blank lines at the end). Parameters ---------- text : str The text from which to remove the signature block. rQr-rN)striprRrangelenr))rJlinesline_numrUs rFstrip_newsgroup_footerras JJL  t $E#e*q."b1X ::<  c "b (  !|yyy)** rO>allr%r$z array-likeboolean random_stateleft)closedgneither) data_homesubset categoriesshufflerdr4download_if_missing return_X_yrrT)prefer_skip_nested_validationr$*Fg?c t|}t|t} tjj |d} d} tjj | rQ t| d5} | j}dddtjd}tj|} | 2|r%tj!dt#| | || } n t%d |d vr| |}n|d k(rt'}t'}t'}d D]X}| |}|j)|j*|j)|j,|j)|j.Z|_t1j2||_t1j2||_t5d }|_d|vr(|j*Dcgc] }t9|c}|_d|vr(|j*Dcgc] }t;|c}|_d|vr(|j*Dcgc] }t=|c}|_||Dcgc]}|j>jA||f!}}|jCtE|\}}t1jF|j,|}|j.||_|j,||_t1jH||j,|_t'||_t1j2|j*tJ}||}|jM|_|rtO|}t1jP|j,jRd}|jU||j.||_|j,||_t1j2|j*tJ}||}|jM|_|r|j*|j,fS|S#1swYxYw#t$r7}tdtdtdt|Yd}~d}~wwxYwcc}wcc}wcc}wcc}w)aLoad the filenames and data from the 20 newsgroups dataset (classification). Download it if necessary. ================= ========== Classes 20 Samples total 18846 Dimensionality 1 Features text ================= ========== Read more in the :ref:`User Guide <20newsgroups_dataset>`. Parameters ---------- data_home : str or path-like, default=None Specify a download and cache folder for the datasets. If None, all scikit-learn data is stored in '~/scikit_learn_data' subfolders. subset : {'train', 'test', 'all'}, default='train' Select the dataset to load: 'train' for the training set, 'test' for the test set, 'all' for both, with shuffled ordering. categories : array-like, dtype=str, default=None If None (default), load all the categories. If not None, list of category names to load (other categories ignored). shuffle : bool, default=True Whether or not to shuffle the data: might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent. random_state : int, RandomState instance or None, default=42 Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `. remove : tuple, default=() May contain any subset of ('headers', 'footers', 'quotes'). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata. 'headers' removes newsgroup headers, 'footers' removes blocks at the ends of posts that look like signatures, and 'quotes' removes lines that appear to be quoting another post. 'headers' follows an exact standard; the other filters are not always correct. download_if_missing : bool, default=True If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site. return_X_y : bool, default=False If True, returns `(data.data, data.target)` instead of a Bunch object. .. versionadded:: 0.22 n_retries : int, default=3 Number of retries when HTTP errors are encountered. .. versionadded:: 1.5 delay : float, default=1.0 Number of seconds between retries. .. versionadded:: 1.5 Returns ------- bunch : :class:`~sklearn.utils.Bunch` Dictionary-like object, with the following attributes. data : list of shape (n_samples,) The data list to learn. target: ndarray of shape (n_samples,) The target labels. filenames: list of shape (n_samples,) The path to the location of the data. DESCR: str The full description of the dataset. target_names: list of shape (n_classes,) The names of target classes. (data, target) : tuple if `return_X_y=True` A tuple of two ndarrays. The first contains a 2D array of shape (n_samples, n_classes) with each row representing one sample and each column representing the features. The second array of shape (n_samples,) contains the target samples. .. versionadded:: 0.22 Examples -------- >>> from sklearn.datasets import fetch_20newsgroups >>> cats = ['alt.atheism', 'sci.space'] >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats) >>> list(newsgroups_train.target_names) ['alt.atheism', 'sci.space'] >>> newsgroups_train.filenames.shape (1073,) >>> newsgroups_train.target.shape (1073,) >>> newsgroups_train.target[:10] array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0]) rh 20news_homeNrbr&P________________________________________________________________________________zCache loading failedz8Downloading 20news dataset. This may take a few minutes.)r=r>rrz20Newsgroups dataset not foundr#rbtwenty_newsgroups.rstheadersfootersquotesdtyper)+rr CACHE_NAMEr(r r)existsr2readr6decoder8loads Exceptionprintr-r.rGOSErrorlistextenddatatarget filenamesnparrayrDESCRrNrarW target_namesindexsortzipisin searchsortedobjecttolistr arangeshaperk)rhrirjrkrdr4rlrmrrr> twenty_homerCrErDuncompressed_contenterdata_lstrrfdescrrJcatlabelsmaskindicess rFfetch_20newsgroupsrsV 2Iy*5J'',,y-8K E ww~~j! j$' .1%&VVX" .#)==1C\#R LL!56E }  KKR S*&%# E:; ; ""V} 56F ' -F=D OODII & MM$++ &   T^^ ,  -  hhv& ), / 0FDJF>BiiHd+D1H F>BiiHd+D1H 6?CyyIt,T2I AKL#4$$**3/5LL  &\ wwt{{F+-kk$' oofdkk:  ,88DIIV4D>OO% ),7 ))DKK--a01W%0kk'* 88DIIV4G$OO% yy$++%% K_ . .  (O ( ) (O !HH  NIHIMsH P+P<3P Q8Q$$Q)$Q.PP Q%,QQ) rir4rhrlrm normalizeas_framerrc t|}d} |r| ddj|zz } t|| dz} t|dddd |||| } t|d ddd |||| } tj j | r tj| \} }}nttj}|j| jj} |j!| jj}|j#}tj$| ||f| d|rl| j'tj(} |j'tj(}t+j,| dt+j,|d| j.}|dk(r| }| j0}ni|d k(r|}| j0}nU|dk(rPt3j4| |fj}tj6| j0| j0f}t9d}d}dg}|rt;d||d\}}}|rfSt=||||S#t$r}td | d | d|d}~wwxYw)aLoad and vectorize the 20 newsgroups dataset (classification). Download it if necessary. This is a convenience function; the transformation is done using the default settings for :class:`~sklearn.feature_extraction.text.CountVectorizer`. For more advanced usage (stopword filtering, n-gram extraction, etc.), combine fetch_20newsgroups with a custom :class:`~sklearn.feature_extraction.text.CountVectorizer`, :class:`~sklearn.feature_extraction.text.HashingVectorizer`, :class:`~sklearn.feature_extraction.text.TfidfTransformer` or :class:`~sklearn.feature_extraction.text.TfidfVectorizer`. The resulting counts are normalized using :func:`sklearn.preprocessing.normalize` unless normalize is set to False. ================= ========== Classes 20 Samples total 18846 Dimensionality 130107 Features real ================= ========== Read more in the :ref:`User Guide <20newsgroups_dataset>`. Parameters ---------- subset : {'train', 'test', 'all'}, default='train' Select the dataset to load: 'train' for the training set, 'test' for the test set, 'all' for both, with shuffled ordering. remove : tuple, default=() May contain any subset of ('headers', 'footers', 'quotes'). Each of these are kinds of text that will be detected and removed from the newsgroup posts, preventing classifiers from overfitting on metadata. 'headers' removes newsgroup headers, 'footers' removes blocks at the ends of posts that look like signatures, and 'quotes' removes lines that appear to be quoting another post. data_home : str or path-like, default=None Specify an download and cache folder for the datasets. If None, all scikit-learn data is stored in '~/scikit_learn_data' subfolders. download_if_missing : bool, default=True If False, raise an OSError if the data is not locally available instead of trying to download the data from the source site. return_X_y : bool, default=False If True, returns ``(data.data, data.target)`` instead of a Bunch object. .. versionadded:: 0.20 normalize : bool, default=True If True, normalizes each document's feature vector to unit norm using :func:`sklearn.preprocessing.normalize`. .. versionadded:: 0.22 as_frame : bool, default=False If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string, or categorical). The target is a pandas DataFrame or Series depending on the number of `target_columns`. .. versionadded:: 0.24 n_retries : int, default=3 Number of retries when HTTP errors are encountered. .. versionadded:: 1.5 delay : float, default=1.0 Number of seconds between retries. .. versionadded:: 1.5 Returns ------- bunch : :class:`~sklearn.utils.Bunch` Dictionary-like object, with the following attributes. data: {sparse matrix, dataframe} of shape (n_samples, n_features) The input data matrix. If ``as_frame`` is `True`, ``data`` is a pandas DataFrame with sparse columns. target: {ndarray, series} of shape (n_samples,) The target labels. If ``as_frame`` is `True`, ``target`` is a pandas Series. target_names: list of shape (n_classes,) The names of target classes. DESCR: str The full description of the dataset. frame: dataframe of shape (n_samples, n_features + 1) Only present when `as_frame=True`. Pandas DataFrame with ``data`` and ``target``. .. versionadded:: 0.24 (data, target) : tuple if ``return_X_y`` is True `data` and `target` would be of the format defined in the `Bunch` description above. .. versionadded:: 0.20 Examples -------- >>> from sklearn.datasets import fetch_20newsgroups_vectorized >>> newsgroups_vectorized = fetch_20newsgroups_vectorized(subset='test') >>> newsgroups_vectorized.data.shape (7532, 130107) >>> newsgroups_vectorized.target.shape (7532,) rs20newsgroup_vectorizedzremove-rZz.pklr$NT ) rhrirjrkrdr4rlrrr%zThe cached dataset located in z was fetched with an older scikit-learn version and it is not compatible with the scikit-learn version imported. You need to manually delete the file: .r{ )compressF)copyrbrwcategory_classfetch_20newsgroups_vectorized)r sparse_data)rrframer feature_namesr)rr)rrr(r r~joblibload ValueErrorrrint16 fit_transformrtocsr transformget_feature_names_outdumpastypefloat64rrrrspvstack concatenaterrr )rir4rhrlrmrrrrfilebase target_file data_train data_testX_trainX_testrr vectorizerrrrrr target_names rFrrs\ 2I'H I 000 8f+<=K$/ J#/ I ww~~k" -3[[-E *GV]%2884 **:??;AAC%%inn5;;="88:  Wfm4kAN..,rzz*e4U3**L "" 6 !! 5yy'6*+113!2!2I4D4D EF / 0F E#$K5 +   $  tVV|  !#  k 0 >-.9M<    s?I%% J.JJ):__doc__r6loggingr(r8rer;r1 contextlibrnumbersrrrnumpyr scipy.sparsesparserr[rfeature_extraction.textrutilsr r utils._param_validationr r r utils.fixesrrr_baserrrrr getLogger__name__r-r/r}r*r+rGrNcompilerSrWrastrPathLiketuplerrrprOrFrs6 " 5-KK,'   8 $  #8 O  ! $ " @  BJJM  !22;;-678#T*;'(' ){ kxD@A4d9=> #'"    QQh678'2;;- ){ k[KxD@A4d9=> #'     ^ ^rO