arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Shleifer" <sshlei...@gmail.com>
Subject [Python] Saving ChunkedArray to disk and reading with flight
Date Thu, 18 Feb 2021 03:11:31 GMT
*My goal*

I have a list of numpy arrays of uneven length. From the docs, I guess the right format for
this is ChunkedArray

I want to save my list to disk in one process, and then start many new processes (a pytorch
dataloader) that are able to read chunks from the file with low memory overhead.

The current solution is to flatten the array, keep a list of the lengths/offsets, store the
flattened array inĀ  `np.memmap`, then have each process slice into the memmap at the right
index.

It seems that with arrow, we can at least delete the list of lengths/offsets.

*What I have tried:*

padding each entry in the list to a fixed length, and saving pa.Table to pa.NativeFile. Each
process reads it's own pa.Table. This is slower and less memory efficient than `memmap` by
about 15%.

*My questions:*

1) Are there any examples online that do this sort of operation? I can't find how to save
chunked array to disk, or a python Flight example after a few googles.

2) Is it unreasonable to think this will use less memory than np.memmap?

Thanks in advance!

Sam
Mime
View raw message