From user-return-727-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Oct 22 22:58:32 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 071DE180654 for ; Fri, 23 Oct 2020 00:58:32 +0200 (CEST) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id 3E87A124CCA for ; Thu, 22 Oct 2020 22:58:31 +0000 (UTC) Received: (qmail 85542 invoked by uid 500); 22 Oct 2020 22:58:30 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 85531 invoked by uid 99); 22 Oct 2020 22:58:30 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Oct 2020 22:58:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 0A6EE1FF39C for ; Thu, 22 Oct 2020 22:58:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 5.009 X-Spam-Level: ***** X-Spam-Status: No, score=5.009 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, KAM_SOMETLD_ARE_BAD_TLD=5, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_PDS_OTHER_BAD_TLD=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id Ms2QTij_ySpR for ; Thu, 22 Oct 2020 22:58:28 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.167.54; helo=mail-lf1-f54.google.com; envelope-from=karbarcca@gmail.com; receiver= Received: from mail-lf1-f54.google.com (mail-lf1-f54.google.com [209.85.167.54]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 27D93BC453 for ; Thu, 22 Oct 2020 22:58:28 +0000 (UTC) Received: by mail-lf1-f54.google.com with SMTP id l2so4305760lfk.0 for ; Thu, 22 Oct 2020 15:58:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=+4BqncyNVF33NvauYvjIKTIve5/nki77Dm4TVDvNFNc=; b=N0q3E8jzyyaaE/QdMDzLiWbnPUfyyA42hqoRNZp+1pGK4RGUbOY3ZksFT+IPxiVoDZ H8tA2A18lzuqUkrLRliGM3z23pwtM9WGaLUVsOqOdke1MJiIEHmTrLTKFCuL68d5+NFr yszp37xgYX0l175Enq7lUmt3cQnoTcM/IJlywmUT+j74UP1v5x+IfLvhjMtxDfyjocdS tHLsP2pvMB1ehLOnTuaGEgVZvaMKJ40fNpyDz272O8V6beamTl/o7Mt6vXstb3mpy1A1 dk6Ds9Ct1unH9u0NoIcY1uK0lWaaTSyjulBtdG+FwviorDaX3LViSM17yKgKULtBuWfM zqJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=+4BqncyNVF33NvauYvjIKTIve5/nki77Dm4TVDvNFNc=; b=I0eKR2Bq3noN2LhvUWBXpJlLYWLq6t9scSw1CpRPycS0WgZxYrmra/XTdR/kweWfkl EmKoOxZqg9vO4R9c4sNrfkP8VTPi8aLjOD9PXj8meSRnShbzDg2ErJafJI0UjFjAzKJe 2bm263Jb0Nc/AStpQN/MqJe78aKPU/cd5QG8SGhoKIbPcNSQUI/vPclZpwfiAIyyNiXA /AZgg9WJs7Py+gDEDsbAvXaD44N20gVBWF56NdiFwk5mjVZ7jojmXNzrvYHuZDVpg0yE flo2tpe4XPDxf6YaMHdyfs2LPppZyLG/7ZS6x05skiGAfsgcoFwHys3hnKYZ0JIMivpu lyBQ== X-Gm-Message-State: AOAM531VjI409/HMgRyJ6FycOxdo6wgblu4KkML6guwmmp6/JiBfE5Y3 P0PltO9PL7brqi9i12kzQh1EjiAC5OqppkJYbAZyhTiT X-Google-Smtp-Source: ABdhPJyjAyirwNovH1k7W9Zv+4bm24EJV0irZmSVIm433Bj4BfpLdv4BjjHSe80d9REOPhZDwf4NFBCEemgmpI9sM+Y= X-Received: by 2002:ac2:5b90:: with SMTP id o16mr1492595lfn.292.1603407506322; Thu, 22 Oct 2020 15:58:26 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jacob Quinn Date: Thu, 22 Oct 2020 16:58:15 -0600 Message-ID: Subject: Re: Does Arrow Support Larger-than-Memory Handling? To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000d8381b05b24a6650" --000000000000d8381b05b24a6650 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Jacob, Yes, the arrow format allows for larger-than-memory datasets. I can describe a little what this looks like on the Julia side of things, which should be pretty similar in other languages. When you write a dataset to the arrow format, either on disk or in memory, you're laying the data + metadata down in a specific, (mostly) self-describing format, with the actual data in particular being written in pre-determined, binary formats by supported type. Provisions are made for even writing a long table in chunks, and writing columns with a dictionary encoding (which can "compress" a column with low cardinality). What this allows when _reading_, is you can "memory map" this entire region of arrow memory to make it available to a running program, which means the OS will "give you" the memory, without necessarily loading the entire region into RAM at the same time, instead it "swaps" requested regions into RAM as necessary. For the Arrow.jl Julia package, reading a table or stream starts with getting access to an arrow memory region: if a file, it calls `Mmap.mmap(file)`, or you can pass a `Vector{UInt8}` directly. For `Arrow.Stream`, it only reads the initial schema metadata, and then you can iterate the stream to get the group of columns for each batch (chunk). For `Arrow.Table`, it will process all record batches, "chaining" each chunk together using a ChainedVector type for each column. When the columns are "read", they're really just custom types that wrap a specific region of the arrow memory along with the metadata type information and length. This means no new memory (well, practically none) is allocated when creating one of these ArrowVectors or chaining them together, but they still satisfy the AbstractArray interface which allows all sorts of operations on the data. The Arrow.jl package also defines the Tables.jl interface for `Arrow.Table`, which means, for example, you can operate on an arrow-memory-backed DataFrame just by doing `df =3D DataFrame(Arrow.Table(file))`. This builds the `Arrow.Table` like I described above, then the `DataFrame` constructor uses the memory-mapped columns directly in its construction. You can then use all of the useful functionality of DataFrames directly on these arrow columns. Similarly for other Tables.jl-compatible packages, they're just as accessible: SQLite.load!(db, "arrow_data", arrow_table) to load arrow data into an sqlite database, CSV.write("arrow.csv", arrow_table) to write arrow data out as csv file, MySQL.load!(db, "arrow_data", arrow_table) to load data into a mysql database table, and so on. Sorry for the diatribe, but I've actually been meaning to write a bunch of this down for some enhanced documentation for the Arrow.jl package, so consider this a teaser! Hope that helps! -Jacob On Thu, Oct 22, 2020 at 12:39 PM Jacob Zelko wrote: > Hi all, > > Very basic question as I have seen conflicting sources. I come from the > Julia community and was wondering if Arrow can handle larger-than-memory > datasets? I saw this post by Wes McKinney here discussing that the toolin= g > is being laid down: > > Table columns in Arrow C++ can be chunked, so that appending to a table i= s > a zero copy operation, requiring no non-trivial computation or memory > allocation. By designing up front for streaming, chunked tables, appendin= g > to existing in-memory tabler is computationally inexpensive relative to > pandas now. Designing for chunked or streaming data is also essential for > implementing out-of-core algorithms, so we are also laying the foundation > for processing larger-than-memory datasets. > > ~ *Apache Arrow and the =E2=80=9C10 Things I Hate About pandas=E2=80=9D* > > > And then in the docs I saw this: > > The pyarrow.dataset module provides functionality to efficiently work wit= h > tabular, potentially larger than memory and multi-file datasets: > > - A unified interface for different sources: supporting different > sources and file formats (Parquet, Feather files) and different file > systems (local, cloud). > - Discovery of sources (crawling directories, handle directory-based > partitioned datasets, basic schema normalization, ..) > - Optimized reading with predicate pushdown (filtering rows), > projection (selecting columns), parallel reading or fine-grained manag= ing > of tasks. > > Currently, only Parquet and Feather / Arrow IPC files are supported. The > goal is to expand this in the future to other file formats and data sourc= es > (e.g. database connections). > > ~ *Tabular Datasets* > > The article from Wes was from 2017 and the snippet on Tabular Datasets is > from the current documentation for pyarrow. > > Could anyone answer this question or at least clear up my confusion for > me? Thank you! > -- > Jacob Zelko > Georgia Institute of Technology - Biomedical Engineering B.S. '20 > Corning Community College - Engineering Science A.S. '17 > Cell Number: (607) 846-8947 > --000000000000d8381b05b24a6650 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Jacob,

Yes, the arrow format allows = for larger-than-memory datasets. I can describe a little what this looks li= ke on the Julia side of things, which should be pretty similar in other lan= guages.

When you write a dataset to the arrow form= at, either on disk or in memory, you're laying the data=C2=A0+ metadata= down in a specific, (mostly) self-describing format, with the actual data = in particular being written in pre-determined, binary formats by supported = type. Provisions are made for even writing a long table in chunks, and writ= ing columns with a dictionary encoding (which can "compress" a co= lumn with low cardinality).

What this allows when _reading_, is you = can "memory map" this entire region of arrow memory to make it av= ailable to a running program, which means the OS will "give you" = the memory, without necessarily loading the entire region into RAM at the s= ame time, instead it "swaps" requested regions into RAM as necess= ary. For the Arrow.jl Julia package, reading a table or stream starts with = getting access to an arrow memory region: if a file, it calls `Mmap.mmap(fi= le)`, or you can pass a `Vector{UInt8}` directly. For `Arrow.Stream`, it on= ly reads the initial schema metadata, and then you can iterate the stream t= o get the group of columns for each batch (chunk). For `Arrow.Table`, it wi= ll process all record batches, "chaining" each chunk together usi= ng a ChainedVector type for each column. When the columns are "read&qu= ot;, they're really just custom types that wrap a specific region of th= e arrow memory along with the metadata type information and length. This me= ans no new memory (well, practically none) is allocated when creating one o= f these ArrowVectors or chaining them together, but they still satisfy the = AbstractArray interface which allows all sorts of operations on the data.
The Arrow.jl package also defines the Tables.jl interface for `Arrow.= Table`, which means, for example, you can operate on an arrow-memory-backed= DataFrame just by doing `df =3D DataFrame(Arrow.Table(file))`. This builds= the `Arrow.Table` like I described above, then the `DataFrame` constructor= uses the memory-mapped columns directly in its construction. You can then = use all of the useful functionality of DataFrames directly on these arrow c= olumns. Similarly for other Tables.jl-compatible packages, they're just= as accessible: SQLite.load!(db, "arrow_data", arrow_table) to lo= ad arrow data into an sqlite database, CSV.write("arrow.csv", arr= ow_table) to write arrow data out as csv file, MySQL.load!(db, "arrow_= data", arrow_table) to load data into a mysql database table, and so o= n.

Sorry for the diatribe, but I've actually been meaning to wri= te a bunch of this down for some enhanced documentation for the Arrow.jl pa= ckage, so consider this a teaser!

Hope that helps!=

-Jacob

=
On Thu, Oct 22, 2020 at 12:39 PM Jaco= b Zelko <jacobszelko@gmail.com<= /a>> wrote:
<= div dir=3D"ltr">

Hi all,

Very basic question as I have seen confli= cting sources. I come from the Julia community and was wondering if Arrow c= an handle larger-than-memory datasets? I saw this post by Wes McKinney here= discussing that the tooling is being laid down:

Table columns in Arrow C++ can be chunked= , so that appending to a table is a zero copy operation, requiring no non-t= rivial computation or memory allocation. By designing up front for streamin= g, chunked tables, appending to existing in-memory tabler is computationall= y inexpensive relative to pandas now. Designing for chunked or streaming da= ta is also essential for implementing out-of-core algorithms, so we are als= o laying the foundation for processing larger-than-memory datasets.

~ Apache Arrow and the= =E2=80=9C10 Things I Hate About pandas=E2=80=9D

And then in the docs I saw this:

The pyarrow.dataset module provides funct= ionality to efficiently work with tabular, potentially larger than memory a= nd multi-file datasets:

  • A unified interface for different sources: s= upporting different sources and file formats (Parquet, Feather files) and d= ifferent file systems (local, cloud).
  • Discovery of sources (crawling directories, = handle directory-based partitioned datasets, basic schema normalization, ..= )
  • Optimized reading with predicate pushdown (f= iltering rows), projection (selecting columns), parallel reading or fine-gr= ained managing of tasks.

Currently, only Parquet and Feather / Arr= ow IPC files are supported. The goal is to expand this in the future to oth= er file formats and data sources (e.g. database connections).

~ Tabular Datasets

The article from Wes was from 2017 and th= e snippet on Tabular Datasets is from the current documentation for pyarrow.

Could anyone answer this question or at l= east clear up my confusion for me? Thank you!

--
Jacob Zelko
<= div>Georgia Institute of Tech= nology - Biomedical Engineering B.S. '20
<= span style=3D"font-family:verdana,sans-serif">Corning Community College - E= ngineering Science A.S. '17
Cell Number: (607) 846-8947
--000000000000d8381b05b24a6650--