The distinction between heap and off-heap is confusing to someone who works in both java and c++ but I understand what you are saying; there is some minimal overhead there. What I keep trying to say is that when you use malloc (or create a new object in the JVM) you are allocating memory that can’t be paged out of process; it is forced to be ‘on heap’ whether it is native or java heap is independent of that. When you use a memory mapped file, you can access more ‘heap’ as in I can let the OS page data in and out of memory which isn’t something you can unless perhaps netty always uses a file backing store or something like that.

It isn’t like native or ‘off-heap’ is just magic memory that disappears and can be ignored; to the OS there is not much distinction between the JVM heap and native heap but mmap memory is actually in a different category than both those before.

I can identify use cases and algorithms where each is better.

I bet in general you are completely wrong and a solid in-place loading mechanism would win in every case where it is available but I want to investigate with some concrete cases before going further; the OS is pretty good at caching pages. What algorithms are you thinking of when you think that an in-place loading mechanism would be at all slower and when you say that you experience drastically more of one type than another?

I can think of two I would test to see just as perf comparison:

  1. Load a file and copy the data into the JVM heap.
  2. Load a file and repeatedly reference random elements within the dataset
Trying 1,2 with different sized files timed against the input-stream pathway, the file pathway and a never-before-seen memory mapped pathway I think would clarify things.

My assertion is that in-place loading would win in all cases against the current design but of course I have been surprised before.  You are correct in that the use case I am thinking of is a large dataset that I then do needle-in-haystack or some more-or-less aggressive filter operations against as a first step.

That is where my second question about per-batch string tables comes in; I would like to stream data and save it as batches but to access it randomly each batch needs to be somewhat self-contained.  Perhaps I should just save each batch in its own file or something but I thought, from my reading of the data spec, that the arrow binary definition supported per-batch string tables.  Honestly that part is a bit unclear to me in the spec - how partial or per-batch string tables are supposed to be handled and stored.


On Sun, Jul 26, 2020 at 12:23 PM Jacques Nadeau <jacques@apache.org> wrote:

If I memory map a 10G file and randomly address within that file the OS takes care of mapping pages into the process and out.  This memory, while it does have some metrics against the process doesn't affect the malloc or new operators and depending on how it is mapped I can share those pages with other processes.
 
When I was talking about heap churn I was talking about the overhead of objects referencing the mapped memory, not the memory containing the data itself. ArrowBuf, for example, uses somewhere between 100-200 bytes of heap memory when using the Netty buffer, independent of the memory it is pointing at. This heap memory is used for things like reference counting, hierarchical limit tracking, etc. Arrow Java always uses off-heap memory for the data so no heap churn happens due to the data. 
 
So no, in my experience manually managed memory is not faster and it usually creates a larger memory footprint overall dependent upon various OS settings and general load. 

:) There are no requirements here to agree with my perspective. I can identify use cases and algorithms where each is better. We happen to experience drastically more of one type than the other. If your use case is persisting an Arrow dataset long-term and then doing known-position needle-in-a-haystack or streaming reads from fast storage within it, you'll benefit greatly from this pattern.

Arrow's binary format is what allows in-place loading and my question really was is anyone else working with Arrow via Java (like the Flink team) interested in developing this pathway.

I'm not aware of anyone actively working on that for Java. We'd welcome the work.