arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joris Van den Bossche <jorisvandenboss...@gmail.com>
Subject Re: [Python] Filtering _metadata by file path
Date Mon, 08 Feb 2021 11:55:58 GMT
Hi Josh,

As far as I know, the Python bindings for Parquet FileMetaData (and
constituent parts) don't expose any methods to construct those objects
(apart from reading it from a file). For example, creating a FileMetaData
object from a list of RowGroupMetaData is not possible.

So I don't think what you describe is currently possible (apart from
reading the metadata from the files you want again and appending them, as
done in the docs you linked to).

Note that if you use pyarrow to read the dataset using the metadata file,
filtering on the file path can be equivalent to filtering on one of the
partition columns (depending on what subset you wanted to take). And
letting the dataset API doing this filtering can be quite efficient (it
will filter the file paths on read), so it might not necessarily be needed
to do this in advance.

In the C++ layer, there is a "FileMetaData::Subset" method added recently
(for purposes of the datasets API) which can create a new FileMetaData
object with a subset of the row groups based on row group index (position
in the vector of row groups). But this is a) not exposed in Python (but
could be) and b) doesn't directly allow filtering on file path.

Joris

On Sat, 6 Feb 2021 at 16:58, Josh Mayer <joshuaamayer@gmail.com> wrote:

> After writing a _metadata file as done here
> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files,
> I'm wondering if it is possible to read that _metadata file (e.g. using
> pyarrow.parquet.read_metadata), filter out some paths, and write it back to
> disk. I can see that file path info is available, e.g.
>
> meta = pq.read_metadata(...)
> meta.row_group(0).column(0).file_path
>
> But I cannot figure out how to filter or create a FileMetaData object
> (since that is what the metadata_collector param of
>  pyarrow.parquet.write_metadata expects) from either a set of
> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm
> trying to avoid needing to reread the FileMetaData from each file in the
> dataset directly.
>

Mime
View raw message