arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nuernberger <ch...@techascent.com>
Subject memory mapped record batches in Java
Date Sat, 25 Jul 2020 12:45:44 GMT
Hey, I am the author to a Clojure dataframe library, tech.ml.dataset
<https://github.com/techascent/tech.ml.dataset> and we are looking to
upgrade our ability to handle out-of-memory datasets.

I was hoping to use Arrow for this purpose specifically to have a
conversion mechanism where I could stream data into a single Arrow file
with multiple record batches and then load that file and mmap each record
batch.

The current loading mechanism appears quite poor for this use case; it
assumes batch-at-a-time loading by mutating member variables of the root
schema and file loading mechanism and it copies each batch into process
memory.

It seems to me that, assuming each batch is less than 2 GB, FileChannel.map
could be used for each record batch and this would allow one to access data
in those batches in a random-access order as opposed to a single in-order
traverse and it may allow larger-than-memory files to be operated on.

Is there any interest in this pathway? It seems like Arrow is quite close
to realizing this possibility or that it is already possible from nearly
all the other languages but the current Java design, unless I am misreading
the code, precludes this pathway.

Thanks for any thoughts, feedback,

Chris

Mime
View raw message