From dev-return-11543-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Tue Apr  2 17:41:31 2019
Return-Path: <dev-return-11543-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id AC50C180668
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  2 Apr 2019 19:41:30 +0200 (CEST)
Received: (qmail 2038 invoked by uid 500); 2 Apr 2019 17:41:29 -0000
Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@arrow.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@arrow.apache.org>
List-Post: <mailto:dev@arrow.apache.org>
List-Id: <dev.arrow.apache.org>
Reply-To: dev@arrow.apache.org
Delivered-To: mailing list dev@arrow.apache.org
Received: (qmail 2006 invoked by uid 99); 2 Apr 2019 17:41:28 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2019 17:41:28 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7F13EC1F4D
	for <dev@arrow.apache.org>; Tue,  2 Apr 2019 17:41:28 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.201
X-Spam-Level:
X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id jFYjj1iBPRVF for <dev@arrow.apache.org>;
	Tue,  2 Apr 2019 17:41:27 +0000 (UTC)
Received: from mail-it1-f193.google.com (mail-it1-f193.google.com [209.85.166.193])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EE250624F1
	for <dev@arrow.apache.org>; Tue,  2 Apr 2019 17:41:26 +0000 (UTC)
Received: by mail-it1-f193.google.com with SMTP id a190so5059088ite.4
        for <dev@arrow.apache.org>; Tue, 02 Apr 2019 10:41:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=wOVNk/46tN3orN46S9Fn4xiuIacoO2Q8RAYsYeim38Y=;
        b=CFi1VoVNFsgkNDYA2HCv/HBynsFiVI+OSr6AlfUuopJq/fwuEoYtZA07ny/f8iow6q
         01mm22tHtJAjOl4nr9zjB/SYv8HW+NEyNZK+POTo4o0/ZitbBmFTWGIb2BpBbcUJo730
         ciHLS5xRliZTwOv85/DaIvgfSzi8OQRsM192R0I/rkisuFJToTVPcKs/japXSzLxoVM2
         80QVIEmlS6Gvbf1NPNIXJc/MB4R7acWgsZbulfYYAy/IquJMPRCVv2R9L/QtbxevStQb
         YUTTsx6KuFaDk7THlf3zfC3GkbRz7kYpd4SQWwzSOr1e7JbLYd9EGqhmf0hj20Nuw9Je
         nbcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=wOVNk/46tN3orN46S9Fn4xiuIacoO2Q8RAYsYeim38Y=;
        b=glhlOKjSQph4Ra8eHipFWjJzS335zIRClYLTlejfPQ+KsqazGTjaiqo9S07zuA/wRl
         420zUULrO1ik/wmIhUp109dldT1mUHSxQh1dAUekJbazVI6EYcBkBnjgbLcg2xlXeZid
         vRsVgSNidcZ+K3AKTtIGQIhEWx1fS4WwcJfHWo9FSWibafY+/VZ/TTpGLiRx3FLfo3/a
         1qns0x2RMw+ARP38HbNKeiejNoYYRvQGf4bWaAH/iPu7oPMwlM6Evqub7Ql39MVPd81T
         dg5p+WajtKRmuWWRDa4hKQykNb+F74RNvJaRgQXzg36YMjYPD8n44+JF3Hqst4QK/9VZ
         TvFQ==
X-Gm-Message-State: APjAAAVhShJnT8ntRoe3wezrdgfnuyXZfDonJlDekLUvwq46NrZkBRYu
	rLJBLKcX+0yDmPlku6Mnob3C7mw0kQpQIqTOP3FmJnps
X-Google-Smtp-Source: APXvYqwrJI0jHQlkRoR1dSfU8eG1yOblvW/YCusF67MHJgg/nlFyOjzuPOea+wAOm97UrsCpQgy5QrXrW1rsyvgilh8=
X-Received: by 2002:a02:bb04:: with SMTP id y4mr52263245jan.67.1554226885906;
 Tue, 02 Apr 2019 10:41:25 -0700 (PDT)
MIME-Version: 1.0
References: <CAJ8oX-FTx7eSQxmUxdtXeq7PEU4urEzuEDKzBNkbppWfeW-pmg@mail.gmail.com>
In-Reply-To: <CAJ8oX-FTx7eSQxmUxdtXeq7PEU4urEzuEDKzBNkbppWfeW-pmg@mail.gmail.com>
From: Wes McKinney <wesmckinn@gmail.com>
Date: Tue, 2 Apr 2019 12:40:48 -0500
Message-ID: <CAJPUwMC=k=iLLMtfc4Yo5AsS_9k_jSi=z1KoOVML_ScuspPhEg@mail.gmail.com>
Subject: Re: Dask and Arrow Parquet Rewrite
To: dev@arrow.apache.org
Content-Type: text/plain; charset="UTF-8"

hi Matt,

Thanks for this summary. It's important for the Dask user base to be
well supported dealing with Parquet files since this reflects a
sizable fraction of Arrow users at the moment. I also appreciate your
effort to establish cleaner boundaries between the projects to make
ongoing maintenance less painful. We should try to limit the exposure
of internal Parquet serialization details to external consumers.

Would anyone be able to volunteer to help with these items in the next
few weeks? I'm about to head out on vacation and won't be able to look
in detail until the last week of April at the earliest. Please keep in
mind that per my February discussion document, that I intend for the
"Parquet multi-file dataset" logic to be ported to C++ so that it can
be consumed equally well from R, C GLib, Ruby, and MATLAB [1]. So, I
think it's OK to continue building certain things in pure Python in
pyarrow/parquet.py but with the awareness that some of it may have to
be rewritten in C++ in the relatively near future.

- Wes

[1]: https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing

On Tue, Mar 26, 2019 at 7:32 PM Matthew Rocklin <mrocklin@gmail.com> wrote:
>
> Hi All,
>
> A few months ago I started a rewrite of how Dask manages Parquet
> reader/writers in an effort to simplify the system.  This work is here:
> https://github.com/dask/dask/pull/4336
>
> To summarize, Dask uses parquet reader libraries like pyarrow.parquet to
> provide scalable reading of parquet datasets in parallel.  This requires
> both information about how to encode and decode bytes, but also on how to
> select row groups, grab data from S3/GCS/..., apply filters, find sorted
> index columns, and so on that are more commonly critical in a distributed
> setting.  Previously the relationship between the two libraries was
> somewhat messy, where this logic was spread across in a haphazard way.
>
> This PR tries to draw pretty strict lines between the two libraries and
> establish a contract that hopefully we can stick to more easily in the
> future.  For more information about that contract, I'd like to point people
> to the github issue.
>
> Things are looking pretty good so far, but there have been a few missing
> features in Arrow that would be really nice to be able to complete this
> rewrite.  In particular two things have come up so far (though I'm sure
> that more will arise)
>
>    1. The ability to write a metadata file, given metadata collected from
>    writing each row group.  https://issues.apache.org/jira/browse/ARROW-1983
>    2. Getting statistics from types like unicode and datetime that may be
>    stored differently from how users interpret them.
>    https://issues.apache.org/jira/browse/ARROW-4139
>
> My hope is that if we can resolve a few issues like this then we'll be able
> to simlify the relationship between the projects on both sides, reduce
> maintenance burden, and hopefully add improve the overall experience as
> well.
>
> Best,
> -matt
>
> (this came up again in
> https://github.com/apache/arrow/pull/3988#issuecomment-476696143)