Hi Josh,

As far as I know, the Python bindings for Parquet FileMetaData (and constituent parts) don't expose any methods to construct those objects (apart from reading it from a file). For example, creating a FileMetaData object from a list of RowGroupMetaData is not possible.

So I don't think what you describe is currently possible (apart from reading the metadata from the files you want again and appending them, as done in the docs you linked to).

Note that if you use pyarrow to read the dataset using the metadata file, filtering on the file path can be equivalent to filtering on one of the partition columns (depending on what subset you wanted to take). And letting the dataset API doing this filtering can be quite efficient (it will filter the file paths on read), so it might not necessarily be needed to do this in advance.

In the C++ layer, there is a "FileMetaData::Subset" method added recently (for purposes of the datasets API) which can create a new FileMetaData object with a subset of the row groups based on row group index (position in the vector of row groups). But this is a) not exposed in Python (but could be) and b) doesn't directly allow filtering on file path.


On Sat, 6 Feb 2021 at 16:58, Josh Mayer <joshuaamayer@gmail.com> wrote:
After writing a _metadata file as done here https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files, I'm wondering if it is possible to read that _metadata file (e.g. using pyarrow.parquet.read_metadata), filter out some paths, and write it back to disk. I can see that file path info is available, e.g.

meta = pq.read_metadata(...)

But I cannot figure out how to filter or create a FileMetaData object (since that is what the metadata_collector param of  pyarrow.parquet.write_metadata expects) from either a set of RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm trying to avoid needing to reread the FileMetaData from each file in the dataset directly.