From user-return-722-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Oct 22 18:47:07 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 577AD180654 for ; Thu, 22 Oct 2020 20:47:07 +0200 (CEST) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id 8897D1244E9 for ; Thu, 22 Oct 2020 18:47:06 +0000 (UTC) Received: (qmail 14303 invoked by uid 500); 22 Oct 2020 18:47:06 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 14292 invoked by uid 99); 22 Oct 2020 18:47:06 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Oct 2020 18:47:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 86F551FF3A3 for ; Thu, 22 Oct 2020 18:47:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id T5ghc1hYFAmj for ; Thu, 22 Oct 2020 18:47:05 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::935; helo=mail-ua1-x935.google.com; envelope-from=jacek.pliszka@gmail.com; receiver= Received: from mail-ua1-x935.google.com (mail-ua1-x935.google.com [IPv6:2607:f8b0:4864:20::935]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id F07897FB87 for ; Thu, 22 Oct 2020 18:47:04 +0000 (UTC) Received: by mail-ua1-x935.google.com with SMTP id c7so754716uaq.4 for ; Thu, 22 Oct 2020 11:47:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=+665e2eLBGfofVtaKS7IZWqwuaQZ1MT80IETkHEMupc=; b=p1KmfVhG9M6Mo2XdNkbHGd2Wplv3uvYQN1v8K3oFoK6xZ0kGqYS5dn8p0HekN7uJmn XWuHZo+HJ/y3b1dMPrgq/ejWWEO/qHr/vJL6ngc6oa8dfeIggdBHgn1gBUztdDSh7hgB 2UN9mgX2sK6qRHw8fIguJnifFzliBodNF8sRTNPEmG9+Oa1jVJcvYCgAvAX5rP+FpK62 4NoijEJan5TbgOsJr/yTk2PbLd6Bzn+B9zreWpHeNbGlPK2iFANuA1TUPfpt3AdEIceu sigzA/owghENWNRNRINxECYRrmqGP59EySbZC0Xu1yk2B/+zE1NRts2P9X3cB2rRIhUC QyQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=+665e2eLBGfofVtaKS7IZWqwuaQZ1MT80IETkHEMupc=; b=d9Xp6rkXjxv6WiLInl2wCBF0GMpoWRlFn7BfDmPsXAXHDDBeYfoWVaxAZhS6cJomlh BCx0L+V00VXm+34PKvuZSrE9Kxz52HckeZwyQwV1DN1u2J9JvFJXbWGPvGBD74YufnnS UJOuy2vtjOzyihdJaUtU6nTlzzUd+DWkCvEZbWoCleH0FYZdMf/wxdg8QydffSapFR55 fMGsubRfSJKD1EBQGfP9JmuUVnX7ycSkmTqTgFl8euCFHsa13XiSgdyqijXzCtvQ19jT wr2e0MKZN/mOoVfQ5ir9Mqggyj4XXMm9dJPBk7P1TRuXYdPFQNQQ8EE7lr5MIzvJNS5U Vt2w== X-Gm-Message-State: AOAM531jei+oQodpBlR1X/0b8umgNvfvWuu66yyqyM242J4HvVw5h1Sy ZM8YczvelIxrJ55jfCcV0am9exitzyjP58ZZuUDzxSmqoWY= X-Google-Smtp-Source: ABdhPJwEZhI11MGQKBjKFUzCefcubdv5mppw90H1Bn8RgmPv+3cwmP6lCOn7+0WBOxoP9VYlDMbY7FYFBvjNpbf45ms= X-Received: by 2002:ab0:28cc:: with SMTP id g12mr2619953uaq.115.1603392418330; Thu, 22 Oct 2020 11:46:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jacek Pliszka Date: Thu, 22 Oct 2020 20:46:47 +0200 Message-ID: Subject: Re: Does Arrow Support Larger-than-Memory Handling? To: user@arrow.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I believe it would be good if you define your use case. I do handle larger than memory datasets with pyarrow with the use of dataset.scan but my use case is very specific as I am repartitioning and cleaning a bit large datasets. BR, Jacek czw., 22 pa=C5=BA 2020 o 20:39 Jacob Zelko napisa= =C5=82(a): > > Hi all, > > Very basic question as I have seen conflicting sources. I come from the J= ulia community and was wondering if Arrow can handle larger-than-memory dat= asets? I saw this post by Wes McKinney here discussing that the tooling is = being laid down: > > Table columns in Arrow C++ can be chunked, so that appending to a table i= s a zero copy operation, requiring no non-trivial computation or memory all= ocation. By designing up front for streaming, chunked tables, appending to = existing in-memory tabler is computationally inexpensive relative to pandas= now. Designing for chunked or streaming data is also essential for impleme= nting out-of-core algorithms, so we are also laying the foundation for proc= essing larger-than-memory datasets. > > ~ Apache Arrow and the =E2=80=9C10 Things I Hate About pandas=E2=80=9D > > And then in the docs I saw this: > > The pyarrow.dataset module provides functionality to efficiently work wit= h tabular, potentially larger than memory and multi-file datasets: > > A unified interface for different sources: supporting different sources a= nd file formats (Parquet, Feather files) and different file systems (local,= cloud). > Discovery of sources (crawling directories, handle directory-based partit= ioned datasets, basic schema normalization, ..) > Optimized reading with predicate pushdown (filtering rows), projection (s= electing columns), parallel reading or fine-grained managing of tasks. > > Currently, only Parquet and Feather / Arrow IPC files are supported. The = goal is to expand this in the future to other file formats and data sources= (e.g. database connections). > > ~ Tabular Datasets > > The article from Wes was from 2017 and the snippet on Tabular Datasets is= from the current documentation for pyarrow. > > Could anyone answer this question or at least clear up my confusion for m= e? Thank you! > > -- > Jacob Zelko > Georgia Institute of Technology - Biomedical Engineering B.S. '20 > Corning Community College - Engineering Science A.S. '17 > Cell Number: (607) 846-8947