I think there might be some cross-talk happening.  Below I've tried to provide pointers to existing code to hopefully frame this conversation more meaningfully.

The core loading into VectorSchemaRoot [1] does not require memory copies.  All of the code should only be doing ref-count accounting on already allocated buffers.  The one exception to this is, if I read the code correctly, there is possibly an unconditional allocation for bitmaps of all null/non-null values [2].  So what Jacques is getting at is a new AllocationManager [3] that can reference Mmap'd regions is a prerequisite for zero copy data (so it can participate in the accounting process).

It is also the case today that today the IPC code does appear to force copies ultimately through MessageSerializer calls (e.g. readMessageBoday [4]).  So some refactoring here might be required for zero-copy through mmap.

In terms of multithreaded reads, I think in addition to the strategy that was alluded for generating a VectorSchemaRoot  per record-batch in file, you could also look into managing one VectorSchemaRoot per unit of parallelism.  There isn't a facility for this within the core arrow projects.  If you do perform benchmarking in this regard, I'd be interested in seeing the results, I think a lot of the performance tuning of the Java implementation wasn't recorded/reported in a single place, and I at least have not seen any numbers on more recent JVMs.

Just a side note, I think all of these questions are good, but should potentially be discussed on dev@ (I'm not sure how many java maintainers look at user@).  I also agree that a contribution along these lines would be welcome.  If it looks like it will be very large changes it might pay to write up a 1-pager discussing any new APIs before too much investment is made in this area.


[1] https://github.com/apache/arrow/blob/63308c5ef53f95621d89a8fb649daf0dc728a1f1/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java#L54
[2] https://github.com/apache/arrow/blob/3f9f566c2278ac4be3e4df25e1c27661be8a9085/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L343
[3] https://github.com/apache/arrow/blob/2092e18752a9c0494799493b12eb1830052217a2/java/memory/memory-core/src/main/java/org/apache/arrow/memory/AllocationManager.java
[4] https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageSerializer.java#L719

On Sun, Jul 26, 2020 at 1:49 PM Jacques Nadeau <jacques@apache.org> wrote:

On Sun, Jul 26, 2020 at 11:57 AM Chris Nuernberger <chris@techascent.com> wrote:

The distinction between heap and off-heap is confusing to someone who works in both java and c++ but I understand what you are saying; there is some minimal overhead there.

In the JVM there is a very clear distinction and this is precisely what I was referring to. Heap memory in context of the JVM is garbage collected and there is the cost to the churn of objects within this garbage collected space. The vector schema root pipelining pattern was built to minimize this heap churn.

What I keep trying to say is that when you use malloc (or create a new object in the JVM) you are allocating memory that can’t be paged out of process;

Sigh. Per my original response: create an allocation manager which works with one or many mmaped Arrow-IPC formatted files.

I bet in general you are completely wrong


What algorithms are you thinking...

large joins and aggregations of a pipelined input.