arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nuernberger <ch...@techascent.com>
Subject Re: memory mapped record batches in Java
Date Sun, 26 Jul 2020 12:52:27 GMT
Hmm, sounds reasonable enough.  I may be mistaken but it appears to me that
the fact that the current code relies on mutably updating the vector schema
root does preclude concurrent access or parallelized access to multiple
record batches.  Potentially a map-batch method that returns a new
vector-schema-root each time would work.

Thanks for the links!

On Sat, Jul 25, 2020 at 5:25 PM Jacques Nadeau <jacques@apache.org> wrote:

> The current code doesn't preclude this path, it just doesn't have it
> implemented. In many cases, a more intelligent algorithm can page data into
> or out of main memory more efficiently (albeit with more work). This should
> be fairly straightforward to do. The easiest way to get started would
> probably be to implement a new allocation manager that uses MMap memory as
> backing instead of the current ones (Netty [1] and Unsafe [2]). From there,
> you could then enhance the reading to use that allocator to map the right
> offsets into the existing vectors.
>
> 1:
> https://github.com/apache/arrow/blob/master/java/memory/memory-netty/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
> 2:
> https://github.com/apache/arrow/blob/master/java/memory/memory-unsafe/src/main/java/org/apache/arrow/memory/UnsafeAllocationManager.java
>
> On Sat, Jul 25, 2020 at 5:46 AM Chris Nuernberger <chris@techascent.com>
> wrote:
>
>> Hey, I am the author to a Clojure dataframe library, tech.ml.dataset
>> <https://github.com/techascent/tech.ml.dataset> and we are looking to
>> upgrade our ability to handle out-of-memory datasets.
>>
>> I was hoping to use Arrow for this purpose specifically to have a
>> conversion mechanism where I could stream data into a single Arrow file
>> with multiple record batches and then load that file and mmap each record
>> batch.
>>
>> The current loading mechanism appears quite poor for this use case; it
>> assumes batch-at-a-time loading by mutating member variables of the root
>> schema and file loading mechanism and it copies each batch into process
>> memory.
>>
>> It seems to me that, assuming each batch is less than 2 GB,
>> FileChannel.map could be used for each record batch and this would allow
>> one to access data in those batches in a random-access order as opposed to
>> a single in-order traverse and it may allow larger-than-memory files to be
>> operated on.
>>
>> Is there any interest in this pathway? It seems like Arrow is quite close
>> to realizing this possibility or that it is already possible from nearly
>> all the other languages but the current Java design, unless I am misreading
>> the code, precludes this pathway.
>>
>> Thanks for any thoughts, feedback,
>>
>> Chris
>>
>

Mime
View raw message