arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Python] Saving ChunkedArray to disk and reading with flight
Date Thu, 18 Feb 2021 14:35:56 GMT
On the "This is slower and less memory efficient than `memmap` by about
15%." -- if you can show us more precisely what code you have written that
will help us advise you. In principle if you are using pyarrow.memory_map
the performance / memory use shouldn't be significantly different

On Wed, Feb 17, 2021 at 9:57 PM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Sam,
> Could you elaborate on what advantages you were hoping to benefit from
> Arrow?  It seems like the process you describe is probably close to optimal
> (I have limited knowledge of np.memmap). And there could be alternative
> suggestions based on the exact shape of your data and how you want to
> process it.  I added some more comments inline below.
>
> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>
> In Arrow it seems like the natural fit here is to use a ListArray wrapped
> around the numpy arrays. This would add back in the indices/offsets.
>
> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>
> How are you reading back the file?  Are you using MemoryMappedFile [1]?
>
> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>
> ChunkedArray's aren't a first class citizen in the Arrow File Format
> specification.  Working through tables that get converted to RecordBatches
> when saving is all that is supported.
>
>
> 2) Is it unreasonable to think this will use less memory than np.memmap?
>
> I'm not familiar with np.memmap, so I can't really say.
>
>
> [1] https://arrow.apache.org/docs/python/generated/pyarrow
>
>
>
> On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <sshleifer@gmail.com> wrote:
>
>> *My goal*
>> I have a list of numpy arrays of uneven length. From the docs, I guess
>> the right format for this is ChunkedArray
>> I want to save my list to disk in one process, and then start many new
>> processes (a pytorch dataloader) that are able to read chunks from the file
>> with low memory overhead.
>> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>>
>> *What I have tried:*
>> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>>
>> *My questions:*
>> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>> 2) Is it unreasonable to think this will use less memory than np.memmap?
>>
>> Thanks in advance!
>> Sam
>>
>>

Mime
View raw message