Hi Sam, 
Could you elaborate on what advantages you were hoping to benefit from Arrow?  It seems like the process you describe is probably close to optimal (I have limited knowledge of np.memmap). And there could be alternative suggestions based on the exact shape of your data and how you want to process it.  I added some more comments inline below.

The current solution is to flatten the array, keep a list of the lengths/offsets, store the flattened array in  `np.memmap`, then have each process slice into the memmap at the right index.
It seems that with arrow, we can at least delete the list of lengths/offsets.
In Arrow it seems like the natural fit here is to use a ListArray wrapped around the numpy arrays. This would add back in the indices/offsets.

padding each entry in the list to a fixed length, and saving pa.Table to pa.NativeFile. Each process reads it's own pa.Table. This is slower and less memory efficient than `memmap` by about 15%.
How are you reading back the file?  Are you using MemoryMappedFile [1]?

1) Are there any examples online that do this sort of operation? I can't find how to save chunked array to disk, or a python Flight example after a few googles.
ChunkedArray's aren't a first class citizen in the Arrow File Format specification.  Working through tables that get converted to RecordBatches when saving is all that is supported.


2) Is it unreasonable to think this will use less memory than np.memmap?
I'm not familiar with np.memmap, so I can't really say.


[1] https://arrow.apache.org/docs/python/generated/pyarrow

 

On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer <sshleifer@gmail.com> wrote:
My goal
I have a list of numpy arrays of uneven length. From the docs, I guess the right format for this is ChunkedArray
I want to save my list to disk in one process, and then start many new processes (a pytorch dataloader) that are able to read chunks from the file with low memory overhead.
The current solution is to flatten the array, keep a list of the lengths/offsets, store the flattened array in  `np.memmap`, then have each process slice into the memmap at the right index.
It seems that with arrow, we can at least delete the list of lengths/offsets.

What I have tried:
padding each entry in the list to a fixed length, and saving pa.Table to pa.NativeFile. Each process reads it's own pa.Table. This is slower and less memory efficient than `memmap` by about 15%.

My questions:
1) Are there any examples online that do this sort of operation? I can't find how to save chunked array to disk, or a python Flight example after a few googles.
2) Is it unreasonable to think this will use less memory than np.memmap?

Thanks in advance!
Sam