From user-return-1004-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Thu Feb 18 03:57:39 2021
Return-Path: <user-return-1004-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 868EE180633
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 18 Feb 2021 04:57:39 +0100 (CET)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id DFD4964D94
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 18 Feb 2021 03:57:38 +0000 (UTC)
Received: (qmail 1014 invoked by uid 500); 18 Feb 2021 03:57:37 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 1004 invoked by uid 99); 18 Feb 2021 03:57:37 -0000
Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Feb 2021 03:57:37 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id F33C01FF39A
	for <user@arrow.apache.org>; Thu, 18 Feb 2021 03:57:36 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.001
X-Spam-Level:
X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamproc1-he-de.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([116.203.227.195])
	by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024)
	with ESMTP id qQNkrcwWi0nr for <user@arrow.apache.org>;
	Thu, 18 Feb 2021 03:57:36 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.43; helo=mail-ed1-f43.google.com; envelope-from=emkornfield@gmail.com; receiver=<UNKNOWN> 
Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id D4958BCDFF
	for <user@arrow.apache.org>; Thu, 18 Feb 2021 03:57:35 +0000 (UTC)
Received: by mail-ed1-f43.google.com with SMTP id n1so1187282edv.2
        for <user@arrow.apache.org>; Wed, 17 Feb 2021 19:57:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:reply-to:from:date:message-id
         :subject:to;
        bh=dPrVLZlhf0SnRk00O5lcqRFhEQVnfggYPjbQiogDV5I=;
        b=GONTHNv/T4ccT/22fe3zq3PNThsuCbknXJ1Cw/nlAk3pbdEnoy3lyHNn/a4WwFFKWq
         7dey0H7T81uFdim/n2qks8Ctor02qLwmqXkpa0CEu0srhSdeF16exB1kgLwqzAqzf8WM
         /9nviN0vZP+MjwpRqlyDDXM2E57PvYDrg1EdwkqDKJYIkIw4Y5oC/pnCDdAABI2ONyAS
         e1XteU0Fh6vma9wjULe0J6xcEoEt/UwNejDmiGeRMEq7Dkh1CPxN+DQ7VckWlU+uzheg
         a157/f4ORFt6MfM2JDjQb10sZlLakxCq7KtIOeDkyCsTj10KqAZZ4PZqEVLF405fJXHp
         YKgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
         :from:date:message-id:subject:to;
        bh=dPrVLZlhf0SnRk00O5lcqRFhEQVnfggYPjbQiogDV5I=;
        b=Zl0HsFhsF01WGfA+d1ZPGwe1XwmaqXryxniwx2/3nnLYkiS0F1wTrbvcB9dAPEMdUN
         6O1hbsi1eTyLgzcmngP8EC1kLVd10rwdI4Rrpa2CSRirsV9prLEh8aaN0kkoVDz318RI
         VBpxVx7zsZBfqaqlp1NUzbsI72erdLLltmcHHL8Rgcd0+59mpRMMXQ5MEP/1eEyWM+EA
         k+IHT37xzys7qq3TimNACXu1ZWKCukBqFZ4vSF8Hp/1VBK4EKYbZm+h82R2g5uG8c1cD
         TfXsHs9h0O3lVG8vnVjgT1y5oQYtPl39Z/1vvxKbWtCCHYUY75bPA6bPnodHTShNWjG/
         epGw==
X-Gm-Message-State: AOAM532oLfEhMWqAZOn0zRD13BsZMQsRgxuM9oLCcaohTF/UMWG5KG+R
	2KV/9dkoksaOVk5TXiAMBDvY6ZmWi3C49Ai69vPRk+4ruYRBxg==
X-Google-Smtp-Source: ABdhPJw4vsf7MOpPFFpQafQzXp1VoVe1nynxtUaCe6opevnDuCGSJUeVIVIev6Frw5uce5lo6iIgmYvzFl5Ps1mOlOA=
X-Received: by 2002:aa7:d9cb:: with SMTP id v11mr1949006eds.153.1613620654850;
 Wed, 17 Feb 2021 19:57:34 -0800 (PST)
MIME-Version: 1.0
References: <klaae3a2.9cdc7e1c-f816-4976-a4ad-93e22799795a@we.are.superhuman.com>
In-Reply-To: <klaae3a2.9cdc7e1c-f816-4976-a4ad-93e22799795a@we.are.superhuman.com>
Reply-To: emkornfield@gmail.com
From: Micah Kornfield <emkornfield@gmail.com>
Date: Wed, 17 Feb 2021 19:57:24 -0800
Message-ID: <CAK7Z5T-qeo=1GqZhF7i2F7syHT3PZkknLGMMFYssFSOkQDn8Ug@mail.gmail.com>
Subject: Re: [Python] Saving ChunkedArray to disk and reading with flight
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="000000000000ef435505bb945547"

--000000000000ef435505bb945547
Content-Type: text/plain; charset="UTF-8"

Hi Sam,
Could you elaborate on what advantages you were hoping to benefit from
Arrow?  It seems like the process you describe is probably close to optimal
(I have limited knowledge of np.memmap). And there could be alternative
suggestions based on the exact shape of your data and how you want to
process it.  I added some more comments inline below.

The current solution is to flatten the array, keep a list of the
> lengths/offsets, store the flattened array in  `np.memmap`, then have each
> process slice into the memmap at the right index.
> It seems that with arrow, we can at least delete the list of
> lengths/offsets.

In Arrow it seems like the natural fit here is to use a ListArray wrapped
around the numpy arrays. This would add back in the indices/offsets.

padding each entry in the list to a fixed length, and saving pa.Table to
> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
> less memory efficient than `memmap` by about 15%.

How are you reading back the file?  Are you using MemoryMappedFile [1]?

1) Are there any examples online that do this sort of operation? I can't
> find how to save chunked array to disk, or a python Flight example after a
> few googles.

ChunkedArray's aren't a first class citizen in the Arrow File Format
specification.  Working through tables that get converted to RecordBatches
when saving is all that is supported.


2) Is it unreasonable to think this will use less memory than np.memmap?

I'm not familiar with np.memmap, so I can't really say.


[1] https://arrow.apache.org/docs/python/generated/pyarrow



On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <sshleifer@gmail.com> wrote:

> *My goal*
> I have a list of numpy arrays of uneven length. From the docs, I guess the
> right format for this is ChunkedArray
> I want to save my list to disk in one process, and then start many new
> processes (a pytorch dataloader) that are able to read chunks from the file
> with low memory overhead.
> The current solution is to flatten the array, keep a list of the
> lengths/offsets, store the flattened array in  `np.memmap`, then have each
> process slice into the memmap at the right index.
> It seems that with arrow, we can at least delete the list of
> lengths/offsets.
>
> *What I have tried:*
> padding each entry in the list to a fixed length, and saving pa.Table to
> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
> less memory efficient than `memmap` by about 15%.
>
> *My questions:*
> 1) Are there any examples online that do this sort of operation? I can't
> find how to save chunked array to disk, or a python Flight example after a
> few googles.
> 2) Is it unreasonable to think this will use less memory than np.memmap?
>
> Thanks in advance!
> Sam
>
>

--000000000000ef435505bb945547
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi Sam,=C2=A0</div><div>Could you elaborate on what a=
dvantages you were hoping=C2=A0to benefit from Arrow?=C2=A0 It seems like t=
he process you describe is probably close to optimal (I have limited knowle=
dge of np.memmap). And there could be alternative suggestions based on the =
exact shape of your data and how you want to process it.=C2=A0 I added some=
 more comments inline below.</div><div><br></div><div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex">The current solution is to flatten the array, =
keep a list of the lengths/offsets, store the flattened array in=C2=A0 `np.=
memmap`, then have each process slice into the memmap at the right index.<b=
r>It seems that with arrow, we can at least delete the list of lengths/offs=
ets.</blockquote><div>In Arrow it seems like the natural fit here is to use=
 a ListArray wrapped around the numpy arrays. This would add back in the in=
dices/offsets.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">padding each entry in the list to a fixed length, and saving pa.=
Table to pa.NativeFile. Each process reads it&#39;s own pa.Table. This is s=
lower and less memory efficient than `memmap` by about 15%.</blockquote><di=
v>How are you reading back the file?=C2=A0 Are you using MemoryMappedFile [=
1]?</div><div><br></div><div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1=
ex">1) Are there any examples online that do this sort of operation? I can&=
#39;t find how to save chunked array to disk, or a python Flight example af=
ter a few googles.</blockquote><div>ChunkedArray&#39;s aren&#39;t a first=
=C2=A0class citizen in the Arrow File Format specification.=C2=A0 Working t=
hrough tables that get converted to RecordBatches when saving is all that i=
s supported.</div><div><br></div><div><br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex">2) Is it unreasonable to think this will use less me=
mory than np.memmap?</blockquote><div>I&#39;m not familiar with np.memmap, =
so I can&#39;t really say.</div></div><div><br></div><div><br></div><div>[1=
]=C2=A0<a href=3D"https://arrow.apache.org/docs/python/generated/pyarrow">h=
ttps://arrow.apache.org/docs/python/generated/pyarrow</a></div><div><br></d=
iv><div>=C2=A0</div></div></div><br><div class=3D"gmail_quote"><div dir=3D"=
ltr" class=3D"gmail_attr">On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer &lt;=
<a href=3D"mailto:sshleifer@gmail.com">sshleifer@gmail.com</a>&gt; wrote:<b=
r></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div><d=
iv><b>My goal</b><br></div><div>I have a list of numpy arrays of uneven len=
gth. From the docs, I guess the right format for this is ChunkedArray<br></=
div><div>I want to save my list to disk in one process, and then start many=
 new processes (a pytorch dataloader) that are able to read chunks from the=
 file with low memory overhead.<br></div><div>The current solution is to fl=
atten the array, keep a list of the lengths/offsets, store the flattened ar=
ray in=C2=A0 `np.memmap`, then have each process slice into the memmap at t=
he right index.<br></div><div>It seems that with arrow, we can at least del=
ete the list of lengths/offsets.<br></div><div><br></div><div><b>What I hav=
e tried:</b><br></div><div>padding each entry in the list to a fixed length=
, and saving pa.Table to pa.NativeFile. Each process reads it&#39;s own pa.=
Table. This is slower and less memory efficient than `memmap` by about 15%.=
<br></div><div><br></div><div><b>My questions:</b><br></div><div>1) Are the=
re any examples online that do this sort of operation? I can&#39;t find how=
 to save chunked array to disk, or a python Flight example after a few goog=
les.<br></div><div>2) Is it unreasonable to think this will use less memory=
 than np.memmap?<br></div><div><br></div><div>Thanks in advance!<br></div><=
div>Sam<br></div></div><div><div style=3D"display:none;border:0px;width:0px=
;height:0px;overflow:hidden"><img src=3D"https://r.superhuman.com/p_GPlOt3n=
I2BnoOt5g1cuU2KWrU4B2dpjY-2arxqCTIAHw_-pzktLHyy6ImhNK4AaGPE_5xW9fBzC604cR9S=
Ib_EUlQ9QnL1kcZLgM5Ct70iBZ7JK3iyYNgKAlCih0EpAWlXbDfZgfOC4a9ic-7SgRqw3sCrW-A=
kd8_VcickJHsIH36YZATIBt4.gif" alt=3D"" width=3D"1" height=3D"0" style=3D"di=
splay: none; border: 0px; width: 0px; height: 0px; overflow: hidden;"></div=
><br><div></div></div></div></div></blockquote></div>

--000000000000ef435505bb945547--