From user-return-974-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Feb 8 14:13:06 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 0A7DF180621 for ; Mon, 8 Feb 2021 15:13:06 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 068A36497E for ; Mon, 8 Feb 2021 14:11:48 +0000 (UTC) Received: (qmail 17042 invoked by uid 500); 8 Feb 2021 14:11:47 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 16923 invoked by uid 99); 8 Feb 2021 14:11:47 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2021 14:11:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 7CD601FF39A for ; Mon, 8 Feb 2021 14:11:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id BskqkkKktuZQ for ; Mon, 8 Feb 2021 14:11:45 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.217.47; helo=mail-vs1-f47.google.com; envelope-from=joshuaamayer@gmail.com; receiver= Received: from mail-vs1-f47.google.com (mail-vs1-f47.google.com [209.85.217.47]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 407E8BCDC4 for ; Mon, 8 Feb 2021 14:11:45 +0000 (UTC) Received: by mail-vs1-f47.google.com with SMTP id s17so3995174vsn.11 for ; Mon, 08 Feb 2021 06:11:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=yo94ZqNG42tHS7V3mPEHH42EA9BuQ89w3JE0zyKqZZU=; b=B+AYVhApn3pfv048I2gIq/nY1WvEBXZ7YhlbvbLApJ3QepekClL4KShkH2WO4xKQSv Get8boP2lvxa+gfi/uFkV9P4wSM1uSsrLg2C6N9NaWVxkaLWqMnDbQVNAxV6xOHrWqxN emNq1/iAhhgm8ogyTSPUGg3FOiL2/imNa1sxb37dIQFJN43oDcBq4qT7opt2m5IHdi/L C3hDrsrMdPUuDGZ/kMOqcbqos3c+ITnsvDxkhG3mh3ij9M2oc/DtDepzZyZuSSJ1elqT tE2+zQzJ+eD8vHlwQ65ZkgsJJALrJFZwr2/BvNisoLiInwCsDZyPKcJ6UiqUfmjHdviD D0BA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=yo94ZqNG42tHS7V3mPEHH42EA9BuQ89w3JE0zyKqZZU=; b=k4aPxUoY/+y3+7SC6qtC7NdUsWI8q7aJB6jrGZcW7N8tAh7TAAsFMPK7iEb8J2VI9h FiLDqdPNJG2HCcsR5jVVSefW94wsnsjIK1Jo3legLRen35PmhpzD6G7Pdf/ZqyVraGmM RZTRmJMvffCViSqyGiZTw7L9fo7IWlrjBcEEjoX4jxNG5JjUMUVhcjWylKlJRXh5+7bm UVP7e1XZAzAImBBZpmw0k+otpGnRaThpwPzMn6/RYipNXQqd/S+4pLRBvCDOJ/xwRflI rKvR3cP1V4KG0W0wM4P0ri5FKwxIkS/jn0NQEV3UaPoMqdEefzQD5X1QMB9wM2V6qqOx KIHw== X-Gm-Message-State: AOAM531NWaboJh439EOfFKZrfjY6RHSvJpafPFp6SYuREy2rPOvR9Alt X+NWOInsxrYpMYX78MtLPfkPo3p7GtJlCk0CBddE4QdD X-Google-Smtp-Source: ABdhPJyjOicS35RdjKrJZO7zRGfqJ5yS2BIhVKnT+iGrY74Afml9Uluv7qw08ECvcZxckEss7CUeWoiMDSslNk9c18U= X-Received: by 2002:a67:8a87:: with SMTP id m129mr11024658vsd.32.1612793499122; Mon, 08 Feb 2021 06:11:39 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Josh Mayer Date: Mon, 8 Feb 2021 09:11:13 -0500 Message-ID: Subject: Re: [Python] Filtering _metadata by file path To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="0000000000009c7eac05bad3bfe0" --0000000000009c7eac05bad3bfe0 Content-Type: text/plain; charset="UTF-8" Hi Joris, The subset method on row groups would work fine for me. I'd be happy to help expose this in Python if needed. In regards to the dataset partitioning, that route would also work (and is separately useful), assuming I can attach manual partitioning information to a dataset created from a metadata file. I would like to pass something like the partitions argument to ds.FileSystemDataset.from_paths ( https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset), for each row group (or file path) in the metadata_file, e.g. dataset = ds.parquet_dataset(metadata_file, partitions=[ds.field("foo") == 1, ds.field("foo") == 2, ...]) Thanks for the help, Josh On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche < jorisvandenbossche@gmail.com> wrote: > Hi Josh, > > As far as I know, the Python bindings for Parquet FileMetaData (and > constituent parts) don't expose any methods to construct those objects > (apart from reading it from a file). For example, creating a FileMetaData > object from a list of RowGroupMetaData is not possible. > > So I don't think what you describe is currently possible (apart from > reading the metadata from the files you want again and appending them, as > done in the docs you linked to). > > Note that if you use pyarrow to read the dataset using the metadata file, > filtering on the file path can be equivalent to filtering on one of the > partition columns (depending on what subset you wanted to take). And > letting the dataset API doing this filtering can be quite efficient (it > will filter the file paths on read), so it might not necessarily be needed > to do this in advance. > > In the C++ layer, there is a "FileMetaData::Subset" method added recently > (for purposes of the datasets API) which can create a new FileMetaData > object with a subset of the row groups based on row group index (position > in the vector of row groups). But this is a) not exposed in Python (but > could be) and b) doesn't directly allow filtering on file path. > > Joris > > On Sat, 6 Feb 2021 at 16:58, Josh Mayer wrote: > >> After writing a _metadata file as done here >> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files, >> I'm wondering if it is possible to read that _metadata file (e.g. using >> pyarrow.parquet.read_metadata), filter out some paths, and write it back to >> disk. I can see that file path info is available, e.g. >> >> meta = pq.read_metadata(...) >> meta.row_group(0).column(0).file_path >> >> But I cannot figure out how to filter or create a FileMetaData object >> (since that is what the metadata_collector param of >> pyarrow.parquet.write_metadata expects) from either a set of >> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm >> trying to avoid needing to reread the FileMetaData from each file in the >> dataset directly. >> > --0000000000009c7eac05bad3bfe0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Joris,

The subset m= ethod on row groups would work fine for me. I'd be happy to help expose= this in Python if needed.

In regards to the datas= et partitioning, that route would also work (and is separately useful), ass= uming I can attach manual partitioning information to a dataset created fro= m a metadata file. I would like to pass something like the partitions argum= ent to ds.FileSystemDataset.from_paths (https://arro= w.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset), for each row group (or file path) in the metadata_file, e.g.



<= div>Josh

On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche <jorisvandenbossche@gmail.com<= /a>> wrote:
<= div dir=3D"ltr">
Hi Josh,

As = far as I know, the Python bindings for Parquet FileMetaData (and constituen= t parts) don't expose any methods to construct those objects (apart fro= m reading it from a file). For example, creating a FileMetaData object from= a list of RowGroupMetaData is not possible.

So I = don't think what you describe is currently possible (apart from reading= the metadata from the files you want again and appending them, as done in = the docs you linked to).

Note that if you use pyar= row to read the dataset using the metadata file, filtering on the file path= can be equivalent to filtering on one of the partition columns (depending = on what subset you wanted to take). And letting the dataset API doing this = filtering can be quite efficient (it will filter the file paths on read), s= o it might not necessarily be needed to do this in advance.
<= br>
In the C++ layer, there is a "FileMetaData::Subset"= method added recently (for purposes of the datasets API) which can create = a new FileMetaData object with a subset of the row groups based on row grou= p index (position in the vector of row groups). But this is a) not exposed = in Python (but could be) and b) doesn't directly allow filtering on fil= e path.

Joris

After writing a _metadata file as = done here https://arrow.apache.org/docs/python/parquet.html?highlight=3Dwrite= _metadata#writing-metadata-and-common-medata-files, I'm wondering i= f it is possible to read that _metadata file (e.g. using pyarrow.parquet.re= ad_metadata), filter out some paths, and write it back to disk. I can see t= hat file path info is available, e.g.

meta =3D pq.read_metadata(...= )
meta.row_group(0).column(0).file_path

But I cannot figure out h= ow to filter or create a FileMetaData object (since that is what the metada= ta_collector param of =C2=A0pyarrow.parquet.write_metadata expects) from ei= ther a set of RowGroupMetaData or ColumnChunkMetaData objects. Is this poss= ible? I'm trying to avoid needing to reread the FileMetaData from each = file in the dataset directly.
--0000000000009c7eac05bad3bfe0--