arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Mayer <joshuaama...@gmail.com>
Subject Re: [Python] Filtering _metadata by file path
Date Mon, 08 Feb 2021 14:11:13 GMT
Hi Joris,

The subset method on row groups would work fine for me. I'd be happy to
help expose this in Python if needed.

In regards to the dataset partitioning, that route would also work (and is
separately useful), assuming I can attach manual partitioning information
to a dataset created from a metadata file. I would like to pass something
like the partitions argument to ds.FileSystemDataset.from_paths (
https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset),
for each row group (or file path) in the metadata_file, e.g.

dataset = ds.parquet_dataset(metadata_file, partitions=[ds.field("foo") ==
1, ds.field("foo") == 2, ...])

Thanks for the help,

Josh

On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Josh,
>
> As far as I know, the Python bindings for Parquet FileMetaData (and
> constituent parts) don't expose any methods to construct those objects
> (apart from reading it from a file). For example, creating a FileMetaData
> object from a list of RowGroupMetaData is not possible.
>
> So I don't think what you describe is currently possible (apart from
> reading the metadata from the files you want again and appending them, as
> done in the docs you linked to).
>
> Note that if you use pyarrow to read the dataset using the metadata file,
> filtering on the file path can be equivalent to filtering on one of the
> partition columns (depending on what subset you wanted to take). And
> letting the dataset API doing this filtering can be quite efficient (it
> will filter the file paths on read), so it might not necessarily be needed
> to do this in advance.
>
> In the C++ layer, there is a "FileMetaData::Subset" method added recently
> (for purposes of the datasets API) which can create a new FileMetaData
> object with a subset of the row groups based on row group index (position
> in the vector of row groups). But this is a) not exposed in Python (but
> could be) and b) doesn't directly allow filtering on file path.
>
> Joris
>
> On Sat, 6 Feb 2021 at 16:58, Josh Mayer <joshuaamayer@gmail.com> wrote:
>
>> After writing a _metadata file as done here
>> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files,
>> I'm wondering if it is possible to read that _metadata file (e.g. using
>> pyarrow.parquet.read_metadata), filter out some paths, and write it back to
>> disk. I can see that file path info is available, e.g.
>>
>> meta = pq.read_metadata(...)
>> meta.row_group(0).column(0).file_path
>>
>> But I cannot figure out how to filter or create a FileMetaData object
>> (since that is what the metadata_collector param of
>>  pyarrow.parquet.write_metadata expects) from either a set of
>> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm
>> trying to avoid needing to reread the FileMetaData from each file in the
>> dataset directly.
>>
>

Mime
View raw message