From user-return-1004-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Feb 18 03:57:39 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 868EE180633 for ; Thu, 18 Feb 2021 04:57:39 +0100 (CET) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id DFD4964D94 for ; Thu, 18 Feb 2021 03:57:38 +0000 (UTC) Received: (qmail 1014 invoked by uid 500); 18 Feb 2021 03:57:37 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 1004 invoked by uid 99); 18 Feb 2021 03:57:37 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Feb 2021 03:57:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id F33C01FF39A for ; Thu, 18 Feb 2021 03:57:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.001 X-Spam-Level: X-Spam-Status: No, score=-0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id qQNkrcwWi0nr for ; Thu, 18 Feb 2021 03:57:36 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.208.43; helo=mail-ed1-f43.google.com; envelope-from=emkornfield@gmail.com; receiver= Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id D4958BCDFF for ; Thu, 18 Feb 2021 03:57:35 +0000 (UTC) Received: by mail-ed1-f43.google.com with SMTP id n1so1187282edv.2 for ; Wed, 17 Feb 2021 19:57:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to; bh=dPrVLZlhf0SnRk00O5lcqRFhEQVnfggYPjbQiogDV5I=; b=GONTHNv/T4ccT/22fe3zq3PNThsuCbknXJ1Cw/nlAk3pbdEnoy3lyHNn/a4WwFFKWq 7dey0H7T81uFdim/n2qks8Ctor02qLwmqXkpa0CEu0srhSdeF16exB1kgLwqzAqzf8WM /9nviN0vZP+MjwpRqlyDDXM2E57PvYDrg1EdwkqDKJYIkIw4Y5oC/pnCDdAABI2ONyAS e1XteU0Fh6vma9wjULe0J6xcEoEt/UwNejDmiGeRMEq7Dkh1CPxN+DQ7VckWlU+uzheg a157/f4ORFt6MfM2JDjQb10sZlLakxCq7KtIOeDkyCsTj10KqAZZ4PZqEVLF405fJXHp YKgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to; bh=dPrVLZlhf0SnRk00O5lcqRFhEQVnfggYPjbQiogDV5I=; b=Zl0HsFhsF01WGfA+d1ZPGwe1XwmaqXryxniwx2/3nnLYkiS0F1wTrbvcB9dAPEMdUN 6O1hbsi1eTyLgzcmngP8EC1kLVd10rwdI4Rrpa2CSRirsV9prLEh8aaN0kkoVDz318RI VBpxVx7zsZBfqaqlp1NUzbsI72erdLLltmcHHL8Rgcd0+59mpRMMXQ5MEP/1eEyWM+EA k+IHT37xzys7qq3TimNACXu1ZWKCukBqFZ4vSF8Hp/1VBK4EKYbZm+h82R2g5uG8c1cD TfXsHs9h0O3lVG8vnVjgT1y5oQYtPl39Z/1vvxKbWtCCHYUY75bPA6bPnodHTShNWjG/ epGw== X-Gm-Message-State: AOAM532oLfEhMWqAZOn0zRD13BsZMQsRgxuM9oLCcaohTF/UMWG5KG+R 2KV/9dkoksaOVk5TXiAMBDvY6ZmWi3C49Ai69vPRk+4ruYRBxg== X-Google-Smtp-Source: ABdhPJw4vsf7MOpPFFpQafQzXp1VoVe1nynxtUaCe6opevnDuCGSJUeVIVIev6Frw5uce5lo6iIgmYvzFl5Ps1mOlOA= X-Received: by 2002:aa7:d9cb:: with SMTP id v11mr1949006eds.153.1613620654850; Wed, 17 Feb 2021 19:57:34 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: Reply-To: emkornfield@gmail.com From: Micah Kornfield Date: Wed, 17 Feb 2021 19:57:24 -0800 Message-ID: Subject: Re: [Python] Saving ChunkedArray to disk and reading with flight To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000ef435505bb945547" --000000000000ef435505bb945547 Content-Type: text/plain; charset="UTF-8" Hi Sam, Could you elaborate on what advantages you were hoping to benefit from Arrow? It seems like the process you describe is probably close to optimal (I have limited knowledge of np.memmap). And there could be alternative suggestions based on the exact shape of your data and how you want to process it. I added some more comments inline below. The current solution is to flatten the array, keep a list of the > lengths/offsets, store the flattened array in `np.memmap`, then have each > process slice into the memmap at the right index. > It seems that with arrow, we can at least delete the list of > lengths/offsets. In Arrow it seems like the natural fit here is to use a ListArray wrapped around the numpy arrays. This would add back in the indices/offsets. padding each entry in the list to a fixed length, and saving pa.Table to > pa.NativeFile. Each process reads it's own pa.Table. This is slower and > less memory efficient than `memmap` by about 15%. How are you reading back the file? Are you using MemoryMappedFile [1]? 1) Are there any examples online that do this sort of operation? I can't > find how to save chunked array to disk, or a python Flight example after a > few googles. ChunkedArray's aren't a first class citizen in the Arrow File Format specification. Working through tables that get converted to RecordBatches when saving is all that is supported. 2) Is it unreasonable to think this will use less memory than np.memmap? I'm not familiar with np.memmap, so I can't really say. [1] https://arrow.apache.org/docs/python/generated/pyarrow On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer wrote: > *My goal* > I have a list of numpy arrays of uneven length. From the docs, I guess the > right format for this is ChunkedArray > I want to save my list to disk in one process, and then start many new > processes (a pytorch dataloader) that are able to read chunks from the file > with low memory overhead. > The current solution is to flatten the array, keep a list of the > lengths/offsets, store the flattened array in `np.memmap`, then have each > process slice into the memmap at the right index. > It seems that with arrow, we can at least delete the list of > lengths/offsets. > > *What I have tried:* > padding each entry in the list to a fixed length, and saving pa.Table to > pa.NativeFile. Each process reads it's own pa.Table. This is slower and > less memory efficient than `memmap` by about 15%. > > *My questions:* > 1) Are there any examples online that do this sort of operation? I can't > find how to save chunked array to disk, or a python Flight example after a > few googles. > 2) Is it unreasonable to think this will use less memory than np.memmap? > > Thanks in advance! > Sam > > --000000000000ef435505bb945547 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Sam,=C2=A0
Could you elaborate on what a= dvantages you were hoping=C2=A0to benefit from Arrow?=C2=A0 It seems like t= he process you describe is probably close to optimal (I have limited knowle= dge of np.memmap). And there could be alternative suggestions based on the = exact shape of your data and how you want to process it.=C2=A0 I added some= more comments inline below.

The current solution is to flatten the array, = keep a list of the lengths/offsets, store the flattened array in=C2=A0 `np.= memmap`, then have each process slice into the memmap at the right index.It seems that with arrow, we can at least delete the list of lengths/offs= ets.
In Arrow it seems like the natural fit here is to use= a ListArray wrapped around the numpy arrays. This would add back in the in= dices/offsets.

padding each entry in the list to a fixed length, and saving pa.= Table to pa.NativeFile. Each process reads it's own pa.Table. This is s= lower and less memory efficient than `memmap` by about 15%.
How are you reading back the file?=C2=A0 Are you using MemoryMappedFile [= 1]?

1) Are there any examples online that do this sort of operation? I can&= #39;t find how to save chunked array to disk, or a python Flight example af= ter a few googles.
ChunkedArray's aren't a first= =C2=A0class citizen in the Arrow File Format specification.=C2=A0 Working t= hrough tables that get converted to RecordBatches when saving is all that i= s supported.


2) Is it unreasonable to think this will use less me= mory than np.memmap?
I'm not familiar with np.memmap, = so I can't really say.



=C2=A0

On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <= sshleifer@gmail.com> wrote:
My goal
I have a list of numpy arrays of uneven len= gth. From the docs, I guess the right format for this is ChunkedArray
I want to save my list to disk in one process, and then start many= new processes (a pytorch dataloader) that are able to read chunks from the= file with low memory overhead.
The current solution is to fl= atten the array, keep a list of the lengths/offsets, store the flattened ar= ray in=C2=A0 `np.memmap`, then have each process slice into the memmap at t= he right index.
It seems that with arrow, we can at least del= ete the list of lengths/offsets.

What I hav= e tried:
padding each entry in the list to a fixed length= , and saving pa.Table to pa.NativeFile. Each process reads it's own pa.= Table. This is slower and less memory efficient than `memmap` by about 15%.=

My questions:
1) Are the= re any examples online that do this sort of operation? I can't find how= to save chunked array to disk, or a python Flight example after a few goog= les.
2) Is it unreasonable to think this will use less memory= than np.memmap?

Thanks in advance!
<= div>Sam
3D""
--000000000000ef435505bb945547--