arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nuernberger <ch...@techascent.com>
Subject Re: memory mapped record batches in Java
Date Sun, 26 Jul 2020 18:56:58 GMT
The distinction between heap and off-heap is confusing to someone who works
in both java and c++ but I understand what you are saying; there is some
minimal overhead there. What I keep trying to say is that when you use
malloc (or create a new object in the JVM) you are allocating memory that
can’t be paged out of process; it is forced to be ‘on heap’ whether it is
native or java heap is independent of that. When you use a memory mapped
file, you can access more ‘heap’ as in I can let the OS page data in and
out of memory which isn’t something you can unless perhaps netty always
uses a file backing store or something like that.

It isn’t like native or ‘off-heap’ is just magic memory that disappears and
can be ignored; to the OS there is not much distinction between the JVM
heap and native heap but mmap memory *is* actually in a different category
than both those before.

I can identify use cases and algorithms where each is better.

I bet in general you are completely wrong and a solid in-place loading
mechanism would win in every case where it is available but I want to
investigate with some concrete cases before going further; the OS is pretty
good at caching pages. What algorithms are you thinking of when you think
that an in-place loading mechanism would be at all slower and when you say
that you experience drastically more of one type than another?

I can think of two I would test to see just as perf comparison:

   1. Load a file and copy the data into the JVM heap.
   2. Load a file and repeatedly reference random elements within the
   dataset

Trying 1,2 with different sized files timed against the input-stream
pathway, the file pathway and a never-before-seen memory mapped pathway I
think would clarify things.

My assertion is that in-place loading would win in all cases against the
current design but of course I have been surprised before.  You are correct
in that the use case I am thinking of is a large dataset that I then do
needle-in-haystack or some more-or-less aggressive filter operations
against as a first step.

That is where my second question about per-batch string tables comes in; I
would like to stream data and save it as batches but to access it randomly
each batch needs to be somewhat self-contained.  Perhaps I should just save
each batch in its own file or something but I thought, from my reading of
the data spec, that the arrow binary definition supported per-batch string
tables.  Honestly that part is a bit unclear to me in the spec - how
partial or per-batch string tables are supposed to be handled and stored.

On Sun, Jul 26, 2020 at 12:23 PM Jacques Nadeau <jacques@apache.org> wrote:

>
> If I memory map a 10G file and randomly address within that file the OS
>> takes care of mapping pages into the process and out.  This memory, while
>> it does have some metrics against the process doesn't affect the malloc or
>> new operators and depending on how it is mapped I can share those pages
>> with other processes.
>>
>
> When I was talking about heap churn I was talking about the overhead of
> objects referencing the mapped memory, not the memory containing the data
> itself. ArrowBuf, for example, uses somewhere between 100-200 bytes of heap
> memory when using the Netty buffer, independent of the memory it is
> pointing at. This heap memory is used for things like reference counting,
> hierarchical limit tracking, etc. Arrow Java always uses off-heap memory
> for the data so no heap churn happens due to the data.
>
>
>> So no, in my experience manually managed memory is not faster and it
>> usually creates a larger memory footprint overall dependent upon various OS
>> settings and general load.
>>
>
> :) There are no requirements here to agree with my perspective. I can
> identify use cases and algorithms where each is better. We happen to
> experience drastically more of one type than the other. If your use case is
> persisting an Arrow dataset long-term and then doing known-position
> needle-in-a-haystack or streaming reads from fast storage within it, you'll
> benefit greatly from this pattern.
>
> Arrow's binary format is what allows in-place loading and my question
>> really was is anyone else working with Arrow via Java (like the Flink team)
>> interested in developing this pathway.
>>
>
> I'm not aware of anyone actively working on that for Java. We'd welcome
> the work.
>

Mime
View raw message