arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: memory mapped record batches in Java
Date Sun, 26 Jul 2020 18:23:28 GMT
> If I memory map a 10G file and randomly address within that file the OS
> takes care of mapping pages into the process and out.  This memory, while
> it does have some metrics against the process doesn't affect the malloc or
> new operators and depending on how it is mapped I can share those pages
> with other processes.
>

When I was talking about heap churn I was talking about the overhead of
objects referencing the mapped memory, not the memory containing the data
itself. ArrowBuf, for example, uses somewhere between 100-200 bytes of heap
memory when using the Netty buffer, independent of the memory it is
pointing at. This heap memory is used for things like reference counting,
hierarchical limit tracking, etc. Arrow Java always uses off-heap memory
for the data so no heap churn happens due to the data.


> So no, in my experience manually managed memory is not faster and it
> usually creates a larger memory footprint overall dependent upon various OS
> settings and general load.
>

:) There are no requirements here to agree with my perspective. I can
identify use cases and algorithms where each is better. We happen to
experience drastically more of one type than the other. If your use case is
persisting an Arrow dataset long-term and then doing known-position
needle-in-a-haystack or streaming reads from fast storage within it, you'll
benefit greatly from this pattern.

Arrow's binary format is what allows in-place loading and my question
> really was is anyone else working with Arrow via Java (like the Flink team)
> interested in developing this pathway.
>

I'm not aware of anyone actively working on that for Java. We'd welcome the
work.

Mime
View raw message