From dev-return-11543-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Apr 2 17:41:31 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id AC50C180668 for ; Tue, 2 Apr 2019 19:41:30 +0200 (CEST) Received: (qmail 2038 invoked by uid 500); 2 Apr 2019 17:41:29 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 2006 invoked by uid 99); 2 Apr 2019 17:41:28 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Apr 2019 17:41:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7F13EC1F4D for ; Tue, 2 Apr 2019 17:41:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id jFYjj1iBPRVF for ; Tue, 2 Apr 2019 17:41:27 +0000 (UTC) Received: from mail-it1-f193.google.com (mail-it1-f193.google.com [209.85.166.193]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EE250624F1 for ; Tue, 2 Apr 2019 17:41:26 +0000 (UTC) Received: by mail-it1-f193.google.com with SMTP id a190so5059088ite.4 for ; Tue, 02 Apr 2019 10:41:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=wOVNk/46tN3orN46S9Fn4xiuIacoO2Q8RAYsYeim38Y=; b=CFi1VoVNFsgkNDYA2HCv/HBynsFiVI+OSr6AlfUuopJq/fwuEoYtZA07ny/f8iow6q 01mm22tHtJAjOl4nr9zjB/SYv8HW+NEyNZK+POTo4o0/ZitbBmFTWGIb2BpBbcUJo730 ciHLS5xRliZTwOv85/DaIvgfSzi8OQRsM192R0I/rkisuFJToTVPcKs/japXSzLxoVM2 80QVIEmlS6Gvbf1NPNIXJc/MB4R7acWgsZbulfYYAy/IquJMPRCVv2R9L/QtbxevStQb YUTTsx6KuFaDk7THlf3zfC3GkbRz7kYpd4SQWwzSOr1e7JbLYd9EGqhmf0hj20Nuw9Je nbcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=wOVNk/46tN3orN46S9Fn4xiuIacoO2Q8RAYsYeim38Y=; b=glhlOKjSQph4Ra8eHipFWjJzS335zIRClYLTlejfPQ+KsqazGTjaiqo9S07zuA/wRl 420zUULrO1ik/wmIhUp109dldT1mUHSxQh1dAUekJbazVI6EYcBkBnjgbLcg2xlXeZid vRsVgSNidcZ+K3AKTtIGQIhEWx1fS4WwcJfHWo9FSWibafY+/VZ/TTpGLiRx3FLfo3/a 1qns0x2RMw+ARP38HbNKeiejNoYYRvQGf4bWaAH/iPu7oPMwlM6Evqub7Ql39MVPd81T dg5p+WajtKRmuWWRDa4hKQykNb+F74RNvJaRgQXzg36YMjYPD8n44+JF3Hqst4QK/9VZ TvFQ== X-Gm-Message-State: APjAAAVhShJnT8ntRoe3wezrdgfnuyXZfDonJlDekLUvwq46NrZkBRYu rLJBLKcX+0yDmPlku6Mnob3C7mw0kQpQIqTOP3FmJnps X-Google-Smtp-Source: APXvYqwrJI0jHQlkRoR1dSfU8eG1yOblvW/YCusF67MHJgg/nlFyOjzuPOea+wAOm97UrsCpQgy5QrXrW1rsyvgilh8= X-Received: by 2002:a02:bb04:: with SMTP id y4mr52263245jan.67.1554226885906; Tue, 02 Apr 2019 10:41:25 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Wes McKinney Date: Tue, 2 Apr 2019 12:40:48 -0500 Message-ID: Subject: Re: Dask and Arrow Parquet Rewrite To: dev@arrow.apache.org Content-Type: text/plain; charset="UTF-8" hi Matt, Thanks for this summary. It's important for the Dask user base to be well supported dealing with Parquet files since this reflects a sizable fraction of Arrow users at the moment. I also appreciate your effort to establish cleaner boundaries between the projects to make ongoing maintenance less painful. We should try to limit the exposure of internal Parquet serialization details to external consumers. Would anyone be able to volunteer to help with these items in the next few weeks? I'm about to head out on vacation and won't be able to look in detail until the last week of April at the earliest. Please keep in mind that per my February discussion document, that I intend for the "Parquet multi-file dataset" logic to be ported to C++ so that it can be consumed equally well from R, C GLib, Ruby, and MATLAB [1]. So, I think it's OK to continue building certain things in pure Python in pyarrow/parquet.py but with the awareness that some of it may have to be rewritten in C++ in the relatively near future. - Wes [1]: https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing On Tue, Mar 26, 2019 at 7:32 PM Matthew Rocklin wrote: > > Hi All, > > A few months ago I started a rewrite of how Dask manages Parquet > reader/writers in an effort to simplify the system. This work is here: > https://github.com/dask/dask/pull/4336 > > To summarize, Dask uses parquet reader libraries like pyarrow.parquet to > provide scalable reading of parquet datasets in parallel. This requires > both information about how to encode and decode bytes, but also on how to > select row groups, grab data from S3/GCS/..., apply filters, find sorted > index columns, and so on that are more commonly critical in a distributed > setting. Previously the relationship between the two libraries was > somewhat messy, where this logic was spread across in a haphazard way. > > This PR tries to draw pretty strict lines between the two libraries and > establish a contract that hopefully we can stick to more easily in the > future. For more information about that contract, I'd like to point people > to the github issue. > > Things are looking pretty good so far, but there have been a few missing > features in Arrow that would be really nice to be able to complete this > rewrite. In particular two things have come up so far (though I'm sure > that more will arise) > > 1. The ability to write a metadata file, given metadata collected from > writing each row group. https://issues.apache.org/jira/browse/ARROW-1983 > 2. Getting statistics from types like unicode and datetime that may be > stored differently from how users interpret them. > https://issues.apache.org/jira/browse/ARROW-4139 > > My hope is that if we can resolve a few issues like this then we'll be able > to simlify the relationship between the projects on both sides, reduce > maintenance burden, and hopefully add improve the overall experience as > well. > > Best, > -matt > > (this came up again in > https://github.com/apache/arrow/pull/3988#issuecomment-476696143)