From user-return-974-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Mon Feb  8 14:13:06 2021
Return-Path: <user-return-974-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 0A7DF180621
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  8 Feb 2021 15:13:06 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 068A36497E
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  8 Feb 2021 14:11:48 +0000 (UTC)
Received: (qmail 17042 invoked by uid 500); 8 Feb 2021 14:11:47 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 16923 invoked by uid 99); 8 Feb 2021 14:11:47 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2021 14:11:47 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 7CD601FF39A
	for <user@arrow.apache.org>; Mon,  8 Feb 2021 14:11:46 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.002
X-Spam-Level:
X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id BskqkkKktuZQ for <user@arrow.apache.org>;
	Mon,  8 Feb 2021 14:11:45 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.217.47; helo=mail-vs1-f47.google.com; envelope-from=joshuaamayer@gmail.com; receiver=<UNKNOWN> 
Received: from mail-vs1-f47.google.com (mail-vs1-f47.google.com [209.85.217.47])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 407E8BCDC4
	for <user@arrow.apache.org>; Mon,  8 Feb 2021 14:11:45 +0000 (UTC)
Received: by mail-vs1-f47.google.com with SMTP id s17so3995174vsn.11
        for <user@arrow.apache.org>; Mon, 08 Feb 2021 06:11:45 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=yo94ZqNG42tHS7V3mPEHH42EA9BuQ89w3JE0zyKqZZU=;
        b=B+AYVhApn3pfv048I2gIq/nY1WvEBXZ7YhlbvbLApJ3QepekClL4KShkH2WO4xKQSv
         Get8boP2lvxa+gfi/uFkV9P4wSM1uSsrLg2C6N9NaWVxkaLWqMnDbQVNAxV6xOHrWqxN
         emNq1/iAhhgm8ogyTSPUGg3FOiL2/imNa1sxb37dIQFJN43oDcBq4qT7opt2m5IHdi/L
         C3hDrsrMdPUuDGZ/kMOqcbqos3c+ITnsvDxkhG3mh3ij9M2oc/DtDepzZyZuSSJ1elqT
         tE2+zQzJ+eD8vHlwQ65ZkgsJJALrJFZwr2/BvNisoLiInwCsDZyPKcJ6UiqUfmjHdviD
         D0BA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=yo94ZqNG42tHS7V3mPEHH42EA9BuQ89w3JE0zyKqZZU=;
        b=k4aPxUoY/+y3+7SC6qtC7NdUsWI8q7aJB6jrGZcW7N8tAh7TAAsFMPK7iEb8J2VI9h
         FiLDqdPNJG2HCcsR5jVVSefW94wsnsjIK1Jo3legLRen35PmhpzD6G7Pdf/ZqyVraGmM
         RZTRmJMvffCViSqyGiZTw7L9fo7IWlrjBcEEjoX4jxNG5JjUMUVhcjWylKlJRXh5+7bm
         UVP7e1XZAzAImBBZpmw0k+otpGnRaThpwPzMn6/RYipNXQqd/S+4pLRBvCDOJ/xwRflI
         rKvR3cP1V4KG0W0wM4P0ri5FKwxIkS/jn0NQEV3UaPoMqdEefzQD5X1QMB9wM2V6qqOx
         KIHw==
X-Gm-Message-State: AOAM531NWaboJh439EOfFKZrfjY6RHSvJpafPFp6SYuREy2rPOvR9Alt
	X+NWOInsxrYpMYX78MtLPfkPo3p7GtJlCk0CBddE4QdD
X-Google-Smtp-Source: ABdhPJyjOicS35RdjKrJZO7zRGfqJ5yS2BIhVKnT+iGrY74Afml9Uluv7qw08ECvcZxckEss7CUeWoiMDSslNk9c18U=
X-Received: by 2002:a67:8a87:: with SMTP id m129mr11024658vsd.32.1612793499122;
 Mon, 08 Feb 2021 06:11:39 -0800 (PST)
MIME-Version: 1.0
References: <CAFjhcQZjAvuxZYpKhpOPVfwXi5Lbvk9y5E8omNCOVjcNervc+w@mail.gmail.com>
 <CALQtMBaP8YB6h-tTOcSkpzadLx+cn4yrb2A3wHH2JcDgnXa2nA@mail.gmail.com>
In-Reply-To: <CALQtMBaP8YB6h-tTOcSkpzadLx+cn4yrb2A3wHH2JcDgnXa2nA@mail.gmail.com>
From: Josh Mayer <joshuaamayer@gmail.com>
Date: Mon, 8 Feb 2021 09:11:13 -0500
Message-ID: <CAFjhcQYgW0HxKCTfBG=svsQSK1FBbZcYZkvA7gpycqP5KAH7Xw@mail.gmail.com>
Subject: Re: [Python] Filtering _metadata by file path
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="0000000000009c7eac05bad3bfe0"

--0000000000009c7eac05bad3bfe0
Content-Type: text/plain; charset="UTF-8"

Hi Joris,

The subset method on row groups would work fine for me. I'd be happy to
help expose this in Python if needed.

In regards to the dataset partitioning, that route would also work (and is
separately useful), assuming I can attach manual partitioning information
to a dataset created from a metadata file. I would like to pass something
like the partitions argument to ds.FileSystemDataset.from_paths (
https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset),
for each row group (or file path) in the metadata_file, e.g.

dataset = ds.parquet_dataset(metadata_file, partitions=[ds.field("foo") ==
1, ds.field("foo") == 2, ...])

Thanks for the help,

Josh

On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Josh,
>
> As far as I know, the Python bindings for Parquet FileMetaData (and
> constituent parts) don't expose any methods to construct those objects
> (apart from reading it from a file). For example, creating a FileMetaData
> object from a list of RowGroupMetaData is not possible.
>
> So I don't think what you describe is currently possible (apart from
> reading the metadata from the files you want again and appending them, as
> done in the docs you linked to).
>
> Note that if you use pyarrow to read the dataset using the metadata file,
> filtering on the file path can be equivalent to filtering on one of the
> partition columns (depending on what subset you wanted to take). And
> letting the dataset API doing this filtering can be quite efficient (it
> will filter the file paths on read), so it might not necessarily be needed
> to do this in advance.
>
> In the C++ layer, there is a "FileMetaData::Subset" method added recently
> (for purposes of the datasets API) which can create a new FileMetaData
> object with a subset of the row groups based on row group index (position
> in the vector of row groups). But this is a) not exposed in Python (but
> could be) and b) doesn't directly allow filtering on file path.
>
> Joris
>
> On Sat, 6 Feb 2021 at 16:58, Josh Mayer <joshuaamayer@gmail.com> wrote:
>
>> After writing a _metadata file as done here
>> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files,
>> I'm wondering if it is possible to read that _metadata file (e.g. using
>> pyarrow.parquet.read_metadata), filter out some paths, and write it back to
>> disk. I can see that file path info is available, e.g.
>>
>> meta = pq.read_metadata(...)
>> meta.row_group(0).column(0).file_path
>>
>> But I cannot figure out how to filter or create a FileMetaData object
>> (since that is what the metadata_collector param of
>>  pyarrow.parquet.write_metadata expects) from either a set of
>> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm
>> trying to avoid needing to reread the FileMetaData from each file in the
>> dataset directly.
>>
>

--0000000000009c7eac05bad3bfe0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hi Joris,<div><br></div><div>The subset m=
ethod on row groups would work fine for me. I&#39;d be happy to help expose=
 this in Python if needed.</div><div><br></div><div>In regards to the datas=
et partitioning, that route would also work (and is separately useful), ass=
uming I can attach manual partitioning information to a dataset created fro=
m a metadata file. I would like to pass something like the partitions argum=
ent to ds.FileSystemDataset.from_paths (<a href=3D"https://arrow.apache.org=
/docs/python/dataset.html#manual-specification-of-the-dataset">https://arro=
w.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset</=
a>), for each row group (or file path) in the metadata_file, e.g.</div><div=
><br></div><div>dataset =3D ds.parquet_dataset(metadata_file, partitions=3D=
[ds.field(&quot;foo&quot;) =3D=3D 1, ds.field(&quot;foo&quot;) =3D=3D 2, ..=
.])<br></div><div><br></div><div>Thanks for the help,</div><div><br></div><=
div>Josh</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche &lt;<a=
 href=3D"mailto:jorisvandenbossche@gmail.com">jorisvandenbossche@gmail.com<=
/a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><=
div dir=3D"ltr"><div dir=3D"ltr"><div>Hi Josh,</div><div><br></div><div>As =
far as I know, the Python bindings for Parquet FileMetaData (and constituen=
t parts) don&#39;t expose any methods to construct those objects (apart fro=
m reading it from a file). For example, creating a FileMetaData object from=
 a list of RowGroupMetaData is not possible.</div><div><br></div><div>So I =
don&#39;t think what you describe is currently possible (apart from reading=
 the metadata from the files you want again and appending them, as done in =
the docs you linked to).</div><div><br></div><div>Note that if you use pyar=
row to read the dataset using the metadata file, filtering on the file path=
 can be equivalent to filtering on one of the partition columns (depending =
on what subset you wanted to take). And letting the dataset API doing this =
filtering can be quite efficient (it will filter the file paths on read), s=
o it might not necessarily be needed to do this in advance.<br></div><div><=
br></div><div>In the C++ layer, there is a &quot;FileMetaData::Subset&quot;=
 method added recently (for purposes of the datasets API) which can create =
a new FileMetaData object with a subset of the row groups based on row grou=
p index (position in the vector of row groups). But this is a) not exposed =
in Python (but could be) and b) doesn&#39;t directly allow filtering on fil=
e path.</div><div><br></div><div>Joris<br></div></div><br><div class=3D"gma=
il_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, 6 Feb 2021 at 16:58=
, Josh Mayer &lt;<a href=3D"mailto:joshuaamayer@gmail.com" target=3D"_blank=
">joshuaamayer@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail=
_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204=
,204);padding-left:1ex"><div dir=3D"ltr">After writing a _metadata file as =
done here <a href=3D"https://arrow.apache.org/docs/python/parquet.html?high=
light=3Dwrite_metadata#writing-metadata-and-common-medata-files" target=3D"=
_blank">https://arrow.apache.org/docs/python/parquet.html?highlight=3Dwrite=
_metadata#writing-metadata-and-common-medata-files</a>, I&#39;m wondering i=
f it is possible to read that _metadata file (e.g. using pyarrow.parquet.re=
ad_metadata), filter out some paths, and write it back to disk. I can see t=
hat file path info is available, e.g. <br><br>meta =3D pq.read_metadata(...=
)<br>meta.row_group(0).column(0).file_path<br><br>But I cannot figure out h=
ow to filter or create a FileMetaData object (since that is what the metada=
ta_collector param of =C2=A0pyarrow.parquet.write_metadata expects) from ei=
ther a set of RowGroupMetaData or ColumnChunkMetaData objects. Is this poss=
ible? I&#39;m trying to avoid needing to reread the FileMetaData from each =
file in the dataset directly.<br></div>
</blockquote></div></div>
</blockquote></div></div>

--0000000000009c7eac05bad3bfe0--