I have a list of numpy arrays of uneven length. From the docs, I guess the right format for this is ChunkedArray
I want to save my list to disk in one process, and then start many new processes (a pytorch dataloader) that are able to read chunks from the file with low memory overhead.
The current solution is to flatten the array, keep a list of the lengths/offsets, store the flattened array in `np.memmap`, then have each process slice into the memmap at the right index.
It seems that with arrow, we can at least delete the list of lengths/offsets.
What I have tried:
padding each entry in the list to a fixed length, and saving pa.Table to pa.NativeFile. Each process reads it's own pa.Table. This is slower and less memory efficient than `memmap` by about 15%.
1) Are there any examples online that do this sort of operation? I can't find how to save chunked array to disk, or a python Flight example after a few googles.
2) Is it unreasonable to think this will use less memory than np.memmap?
Thanks in advance!