K i:<dZddlZddlmZmZmZ ddlmZm Z m Z m Z m Z m Z mZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$m$Z%m&Z&ddl*m+Z+m,Z,m-Z-dZ.d Z/ dd l0m1Z1d Z.dZ2d Z3 dd l4m5Z5m6Z6m7Z7m8Z8m9Z9m:Z:m;Z;mZ>m?Z?dZ@ ddZAdZBdZCddZDddZE d dZFddZGddZH d!dZI d dZJdZKdddddddd ddddddddd ddZLy#e'$rZ(e'de)e(dddZ([(wwxYw#e'$rYwxYw#e'$rYwxYw#e'$rYwxYw)"zEDataset is currently unstable. APIs subject to change without notice.N) _is_iterable_stringify_path _is_path_like) CsvFileFormatCsvFragmentScanOptionsJsonFileFormatJsonFragmentScanOptionsDatasetDatasetFactoryDirectoryPartitioningFeatherFileFormatFilenamePartitioning FileFormat FileFragmentFileSystemDatasetFileSystemDatasetFactoryFileSystemFactoryOptionsFileWriteOptionsFragmentFragmentScanOptionsHivePartitioning IpcFileFormatIpcFileWriteOptionsInMemoryDataset PartitioningPartitioningFactoryScannerTaggedRecordBatch UnionDatasetUnionDatasetFactory WrittenFileget_partition_keysr"_filesystemdataset_writezBThe pyarrow installation is not built with support for 'dataset' ()) ExpressionscalarfieldFzKThe pyarrow installation is not built with support for the ORC file format.) OrcFileFormatTzOThe pyarrow installation is not built with support for the Parquet file format.)ParquetDatasetFactoryParquetFactoryOptionsParquetFileFormatParquetFileFragmentParquetFileWriteOptionsParquetFragmentScanOptionsParquetReadOptions RowGroupInfo)ParquetDecryptionConfigParquetEncryptionConfigc|dk(rtstt|dk(rtsttt d|d)Nr(r+z+module 'pyarrow.dataset' has no attribute '')_orc_available ImportError_orc_msg_parquet_available _parquet_msgAttributeError)names U/mnt/ssd/data/python-lab/Trading/venv/lib/python3.12/site-packages/pyarrow/dataset.py __getattr__r=msG ~(## ""+=,''  5dV1= c"||4| td|dk(rtj|St||S| the path "/2009/11" would be parsed to ("year"_ == 2009 and "month"_ == 11). - "HivePartitioning": a scheme for "/$key=$value/" nested directories as found in Apache Hive. This is a multi-level, directory based partitioning scheme. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names. For example, given schema, a possible path would be "/year=2009/month=11/day=15" (but the field order does not need to match). - "FilenamePartitioning": this scheme expects the partitions will have filenames containing the field values separated by "_". For example, given schema, a possible partition filename "2009_11_part-0.parquet" would be parsed to ("year"_ == 2009 and "month"_ == 11). Parameters ---------- schema : pyarrow.Schema, default None The schema that describes the partitions present in the file path. If not specified, and `field_names` and/or `flavor` are specified, the schema will be inferred from the file path (and a PartitioningFactory is returned). field_names : list of str, default None A list of strings (field names). If specified, the schema's types are inferred from the file paths (only valid for DirectoryPartitioning). flavor : str, default None The default is DirectoryPartitioning. Specify ``flavor="hive"`` for a HivePartitioning, and ``flavor="filename"`` for a FilenamePartitioning. dictionaries : dict[str, Array] If the type of any field of `schema` is a dictionary type, the corresponding entry of `dictionaries` must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing. Alternatively, pass `infer` to have Arrow discover the dictionary values, in which case a PartitioningFactory is returned. Returns ------- Partitioning or PartitioningFactory The partitioning scheme Examples -------- Specify the Schema for paths like "/2009/June": >>> import pyarrow as pa >>> import pyarrow.dataset as ds >>> part = ds.partitioning(pa.schema([("year", pa.int16()), ... ("month", pa.string())])) or let the types be inferred by only specifying the field names: >>> part = ds.partitioning(field_names=["year", "month"]) For paths like "/2009/June", the year will be inferred as int32 while month will be inferred as string. Specify a Schema with dictionary encoding, providing dictionary values: >>> part = ds.partitioning( ... pa.schema([ ... ("year", pa.int16()), ... ("month", pa.dictionary(pa.int8(), pa.string())) ... ]), ... dictionaries={ ... "month": pa.array(["January", "February", "March"]), ... }) Alternatively, specify a Schema with dictionary encoding, but have Arrow infer the dictionary values: >>> part = ds.partitioning( ... pa.schema([ ... ("year", pa.int16()), ... ("month", pa.dictionary(pa.int8(), pa.string())) ... ]), ... dictionaries="infer") Create a Hive scheme for a path like "/year=2009/month=11": >>> part = ds.partitioning( ... pa.schema([("year", pa.int16()), ("month", pa.int8())]), ... flavor="hive") A Hive scheme can also be discovered from the directory structure (and types will be inferred): >>> part = ds.partitioning(flavor="hive") z.Cannot specify both 'schema' and 'field_names'inferschemaz"Expected list of field names, got zSFor the default directory flavor, need to specify a Schema or a list of field namesfilenamezJFor the filename flavor, need to specify a Schema or a list of field nameshivez.Cannot specify 'field_names' for flavor 'hive'z"Expected Schema for 'schema', got zUnsupported flavor) ValueErrorr discover isinstancelisttyperpaSchemar)rB field_namesflavor dictionariess r< partitioningrOysL~  & DFFw&,55VDD(> >  $+t,,55kBB 8k9J8KLNN45 5  & DFFw&+44FCC' = =  $+t,+44[AA 8k9J8KLNN45 5 6   "MN N  &")),7*+44FCC' == 8fGII$,,. .-..r>c| |St|trt|}|St|trt|}|St|tt fr |St dt|)z~ Validate input and return a Partitioning(Factory). It passes None through if no partitioning scheme is defined. rM)rLz2Expected Partitioning or PartitioningFactory, got )rGstrrOrHrrrErI)schemes r<_ensure_partitioningrTs ~  M FC V, M FD !&1 M F\+>? @  M@f OQ Qr>cJt|tr|S|dk(rtsttt S|dvr t S|dk(r tS|dk(r tS|dk(rtstttS|dk(r tStd|d) Nparquet>ipcarrowfeathercsvorcjsonzformat 'z' is not supported) rGrr8rEr9r+rr rr5r7r(r)objs r<_ensure_formatr^)s#z" !\* * ""    ""  X& & 8C5(:;<c:ddlm}m}m}m}m}||}n||}t |||fxs$t ||xrt |j|}|Dcgc]}|jt|}}|r|j|D]} | j} | |jk(r| |jk(rt| j| |j k(rt#d| jdt%d| jd||fScc}w)aA Treat a list of paths as files belonging to a single file system If the file system is local then also validates that all paths are referencing existing *files* otherwise any non-file paths will be silently skipped (for example on a remote filesystem). Parameters ---------- paths : list of path-like Note that URIs are not allowed. filesystem : FileSystem or str, optional If an URI is passed, then its path component will act as a prefix for the file paths. Returns ------- (FileSystem, list of str) File system object and a list of normalized paths. Raises ------ TypeError If the passed filesystem has wrong type. IOError If the file system is local and a referenced path is not available or not a file. r)LocalFileSystemSubTreeFileSystem_MockFileSystemFileType_ensure_filesystemzPath z points to a directory, but only file paths are supported. To construct a nested or union dataset pass a list of dataset objects instead.z exists but its type is unknown (could be a special file such as a Unix socket or character device, or Windows NUL / CON / ...)) pyarrow.fsr`rarbrcrdrGbase_fsnormalize_pathr get_file_inforIFileNotFoundFileNotFoundErrorpath DirectoryIsADirectoryErrorIOError) paths filesystemr`rarbrcrdis_localpinfo file_types r<_ensure_multiple_sourcesrv@sE: $& ( 3  :AB : J 1 2 9 J&& 8 EJ JqZ & &q'9 : JE J ,,U3 3D IHMM)h///' 22h000'DII;'99 DII;'2233 3$ u 1 Ks!Dcddlm}m}m}|||\}}|j |}|j |}|j |jk(r||d}||fS|j |jk(r|g}||fSt|)a Treat path as either a recursively traversable directory or a single file. Parameters ---------- path : path-like filesystem : FileSystem or str, optional If an URI is passed, then its path component will act as a prefix for the file paths. Returns ------- (FileSystem, list of str or fs.Selector) File system object and either a single item list pointing to a file or an fs.Selector object pointing to a directory. Raises ------ TypeError If the passed filesystem has wrong type. FileNotFoundError If the referenced file or directory doesn't exist. r)rc FileSelector_resolve_filesystem_and_pathT) recursive) rercrxryrgrhrIrmrirk)rlrqrcrxry file_infopaths_or_selectors r<_ensure_single_sourcer}s0PO4D*EJ  $ $T *D((.I~~+++(> ( (( 8== (!F ( (( %%r>c\ddlm}m} m} t |xsd}t |}t |ttfr6|r$t |d| r||} n| |} |} nt||\} } nt||\} } t||||} t| | || }|j|S)z Create a FileSystemDataset which can be used to build a Dataset. Parameters are documented in the dataset function. Returns ------- FileSystemDataset r)r`rdFileInforV)rOpartition_base_direxclude_invalid_filesselector_ignore_prefixes)rer`rdrr^rTrGrHtuplervr}rrfinish)sourcerBrqrOformatrrrr`rdrfsr|optionsfactorys r<_filesystem_datasetrsIH F/i 0F' 5L&4-( jH5!$&( 3 & $>& !!r>c ptd|jDr tdt||S)Nc3$K|]}|du ywN.0vs r< z%_in_memory_dataset.. 2Q1D= 2z@For in-memory datasets, you cannot pass any additional arguments)anyvaluesrEr)rrBkwargss r<_in_memory_datasetrs5 2&--/ 22 NP P 66 **r>c \td|jDr td|-tj|Dcgc]}|j c}}|D]}t |ddstd|Dcgc]}|j|}}t||Scc}wcc}w)Nc3$K|]}|du ywrrrs r<rz!_union_dataset..rrzIWhen passing a list of Datasets, you cannot pass any additional arguments _scan_optionszCreating an UnionDataset from filtered or projected Datasets is currently not supported. Union the unfiltered datasets and apply the filter to the resulting union.) rrrErJ unify_schemasrBgetattrreplace_schemar)childrenrBrchilds r<_union_datasetrs 2&--/ 22   ~!!X"FE5<<"FG 5/4 0? ;CC$$V,CHC  ))#GDs B$>B)c&ddlm}m}| t}nt |ts t d||}n||}|j t|}t|t|}t||||} | j|S)a Create a FileSystemDataset from a `_metadata` file created via `pyarrow.parquet.write_metadata`. Parameters ---------- metadata_path : path, Path pointing to a single file parquet metadata file schema : Schema, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source. filesystem : FileSystem or URI string, default None If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI's optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow 'file:///C:...' or 'file:/C:...' patterns. format : ParquetFileFormat An instance of a ParquetFileFormat if special options needs to be passed. partitioning : Partitioning, PartitioningFactory, str, list of str The partitioning scheme specified with the ``partitioning()`` function. A flavor string can be used as shortcut, and with a list of field names a DirectoryPartitioning will be inferred. partition_base_dir : str, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. Returns ------- FileSystemDataset The dataset corresponding to the given metadata r)r`rdz+format argument must be a ParquetFileFormat)rrO)r) rer`rdr+rGrErgrr*rTr)r) metadata_pathrBrqrrOrr`rdrrs r<parquet_datasetrsP? ~"$  1 2FGG$& ' 3 --om.LMM#-),7G $z67>& !!r>c  ddlm t|||||||}t|r t |fi|St |t tfrt fd|Dr t |fi|Std|Dr t|fi|Std|Dr t|fi|Std|D} djd | D} td | t |tjtj tj"fr t|fi|Std t%|j&) a Open a dataset. Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset. - A unified interface for different sources, like Parquet and Feather - Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization) - Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks. Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.) Parameters ---------- source : path, list of paths, dataset, list of datasets, (list of) RecordBatch or Table, iterable of RecordBatch, RecordBatchReader, or URI Path pointing to a single file: Open a FileSystemDataset from a single file. Path pointing to a directory: The directory gets discovered recursively according to a partitioning scheme if given. List of file paths: Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed. List of datasets: A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed. (List of) batches or tables, iterable of batches, or RecordBatchReader: Create an InMemoryDataset. If an iterable or empty list is given, a schema must also be given. If an iterable or RecordBatchReader is given, the resulting dataset can only be scanned once; further attempts will raise an error. schema : Schema, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source. format : FileFormat or str Currently "parquet", "ipc"/"arrow"/"feather", "csv", "json", and "orc" are supported. For Feather, only version 2 files are supported. filesystem : FileSystem or URI string, default None If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI's optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow 'file:///C:...' or 'file:/C:...' patterns. partitioning : Partitioning, PartitioningFactory, str, list of str The partitioning scheme specified with the ``partitioning()`` function. A flavor string can be used as shortcut, and with a list of field names a DirectoryPartitioning will be inferred. partition_base_dir : str, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. exclude_invalid_files : bool, optional (default True) If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time). ignore_prefixes : list, optional Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is ['.', '_']. Note that discovery happens only if a directory is passed as source. Returns ------- dataset : Dataset Either a FileSystemDataset or a UnionDataset depending on the source parameter. Examples -------- Creating an example Table: >>> import pyarrow as pa >>> import pyarrow.parquet as pq >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021], ... 'n_legs': [2, 2, 4, 4, 5, 100], ... 'animal': ["Flamingo", "Parrot", "Dog", "Horse", ... "Brittle stars", "Centipede"]}) >>> pq.write_table(table, "file.parquet") Opening a single file: >>> import pyarrow.dataset as ds >>> dataset = ds.dataset("file.parquet", format="parquet") >>> dataset.to_table() pyarrow.Table year: int64 n_legs: int64 animal: string ---- year: [[2020,2022,2021,2022,2019,2021]] n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]] Opening a single file with an explicit schema: >>> myschema = pa.schema([ ... ('n_legs', pa.int64()), ... ('animal', pa.string())]) >>> dataset = ds.dataset("file.parquet", schema=myschema, format="parquet") >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[2,2,4,4,5,100]] animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]] Opening a dataset for a single directory: >>> ds.write_dataset(table, "partitioned_dataset", format="parquet", ... partitioning=['year']) >>> dataset = ds.dataset("partitioned_dataset", format="parquet") >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[5],[2],[4,100],[2,4]] animal: [["Brittle stars"],["Flamingo"],...["Parrot","Horse"]] For a single directory from a S3 bucket: >>> ds.dataset("s3://mybucket/nyc-taxi/", ... format="parquet") # doctest: +SKIP Opening a dataset from a list of relatives local paths: >>> dataset = ds.dataset([ ... "partitioned_dataset/2019/part-0.parquet", ... "partitioned_dataset/2020/part-0.parquet", ... "partitioned_dataset/2021/part-0.parquet", ... ], format='parquet') >>> dataset.to_table() pyarrow.Table n_legs: int64 animal: string ---- n_legs: [[5],[2],[4,100]] animal: [["Brittle stars"],["Flamingo"],["Dog","Centipede"]] With filesystem provided: >>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> ds.dataset(paths, filesystem='file:///directory/prefix, ... format='parquet') # doctest: +SKIP Which is equivalent with: >>> fs = SubTreeFileSystem("/directory/prefix", ... LocalFileSystem()) # doctest: +SKIP >>> ds.dataset(paths, filesystem=fs, format='parquet') # doctest: +SKIP With a remote filesystem URI: >>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> ds.dataset(paths, filesystem='s3://bucket/', ... format='parquet') # doctest: +SKIP Similarly to the local example, the directory prefix may be included in the filesystem URI: >>> ds.dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet') # doctest: +SKIP Construction of a nested dataset: >>> ds.dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ]) # doctest: +SKIP r)r)rBrqrOrrrrc3PK|]}t|xs t|ywr)rrG)relemrs r<rzdataset..s%TT}T"@jx&@@Ts#&c3<K|]}t|tywr)rGr rrs r<rzdataset..s>tD'*>sc3pK|].}t|tjtjf0ywr)rGrJ RecordBatchTablers r<rzdataset..s,%D2>>288"<=%s46c3FK|]}t|jywr)rI__name__rs r<rzdataset.. sFttDz22Fs!z, c3"K|]}| ywrr)rts r<rzdataset..!s"@aaS6"@s z~Expected a list of path-like or dataset objects, or a list of batches or tables. The given list contains the following types: zZExpected a path-like, list of path-likes or a list of Datasets instead of the given type: )rerdictrrrGrrHallrrsetjoin TypeErrorrJrrRecordBatchReaderrIr) rrBrrqrOrrignore_prefixesr unique_types type_namesrs @r<datasetrDsJJ$ !-3!0FV"64V44 FUDM * TVT T&v88 8 >v> >!&3F3 3 %#%%%f77 7FvFFL"@<"@@J$'  FR^^RXXr7K7KL M!&3F33 **.v,*?*?)@ B  r>c t|tr tdt|tr |r tdt|tt fr>t tj|Dcgc]}|j|c}|}n"| t tjg|}t|ts td|Scc}w)NzhA PartitioningFactory cannot be used. Did you call the partitioning function without supplying a schema?zKProviding a partitioning_flavor with a Partitioning object is not supportedrBrMrQzDpartitioning must be a Partitioning object or a list of column names) rGrrErrrHrOrJrBr')partrBrMfs r<_ensure_write_partitioningr0s$+,78 8$ %& 5   D5$- (99t>tF3 D2>>2884 5&4;;tF34112 4- .  ##D8 w0 1   ~*T+<='002  $$$@ICCO.RS S '&*@*@@ !$! $ "33"kk-l5H5HJL8*MJ$ ,,;,7  MN N,j, nl0A. r>)NNNNr)NNNNNNN)NNNNN)M__doc__pyarrowrJ pyarrow.utilrrrpyarrow._datasetrrrr r r r r rrrrrrrrrrrrrrrrrrr r!r"_get_partition_keysr#r6excrRpyarrow.computer%r&r'r5r7pyarrow._dataset_orcr(r8r9pyarrow._dataset_parquetr)r*r+r,r-r.r/r0#pyarrow._dataset_parquet_encryptionr1r2r=rOrTr^rvr}rrrrrrrrr>r<rs$LEE%         N65  2N        8<"Y/x(=.HV,)^9=26GK15("V+*2IM:><"~:>268<i X<8