From user-return-554-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Sat Jul 25 23:25:55 2020
Return-Path: <user-return-554-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 1974B18064D
	for <archive-asf-public@cust-asf.ponee.io>; Sun, 26 Jul 2020 01:25:55 +0200 (CEST)
Received: from mail.apache.org (localhost [127.0.0.1])
	by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id BF3FE126DCA
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 25 Jul 2020 23:25:53 +0000 (UTC)
Received: (qmail 38592 invoked by uid 500); 25 Jul 2020 23:25:53 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 38583 invoked by uid 99); 25 Jul 2020 23:25:53 -0000
Received: from Unknown (HELO mailrelay1-lw-us.apache.org) (10.10.3.159)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 25 Jul 2020 23:25:53 +0000
Received: from mail-oo1-f43.google.com (mail-oo1-f43.google.com [209.85.161.43])
	by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id EC78B40286
	for <user@arrow.apache.org>; Sat, 25 Jul 2020 23:25:52 +0000 (UTC)
Received: by mail-oo1-f43.google.com with SMTP id r13so1002465ool.3
        for <user@arrow.apache.org>; Sat, 25 Jul 2020 16:25:52 -0700 (PDT)
X-Gm-Message-State: AOAM533ZbFfkz1ocBGbvyrIUgIC/NKT07hGXqfd6C/LVncuLPLgy5A0q
	v5YK2GhHTFvhuqDqd4FFaNB0vKO80aog+3rcP+8=
X-Google-Smtp-Source: ABdhPJyLHWFcpZtup0hEJBLuVwNF32u2qqOh8wjzPq6Uuh4ZRR7PagrSpA4x3yFHxJ+vSn/I8pzqTkYYXWyf+RlqKgM=
X-Received: by 2002:a4a:a782:: with SMTP id l2mr12799592oom.62.1595719552556;
 Sat, 25 Jul 2020 16:25:52 -0700 (PDT)
MIME-Version: 1.0
References: <CADbpEJtCvLYBPQ46=_+t+DJgqhBWSKmDdet+C9289fm2g-+tQA@mail.gmail.com>
In-Reply-To: <CADbpEJtCvLYBPQ46=_+t+DJgqhBWSKmDdet+C9289fm2g-+tQA@mail.gmail.com>
From: Jacques Nadeau <jacques@apache.org>
Date: Sat, 25 Jul 2020 16:25:41 -0700
X-Gmail-Original-Message-ID: <CAKa9qDmWngXhpRd_=9gOd8X0PpGsExCTDdYrEzTRxzd9srwxQA@mail.gmail.com>
Message-ID: <CAKa9qDmWngXhpRd_=9gOd8X0PpGsExCTDdYrEzTRxzd9srwxQA@mail.gmail.com>
Subject: Re: memory mapped record batches in Java
To: user@arrow.apache.org
Content-Type: multipart/alternative; boundary="00000000000017659005ab4c693d"

--00000000000017659005ab4c693d
Content-Type: text/plain; charset="UTF-8"

The current code doesn't preclude this path, it just doesn't have it
implemented. In many cases, a more intelligent algorithm can page data into
or out of main memory more efficiently (albeit with more work). This should
be fairly straightforward to do. The easiest way to get started would
probably be to implement a new allocation manager that uses MMap memory as
backing instead of the current ones (Netty [1] and Unsafe [2]). From there,
you could then enhance the reading to use that allocator to map the right
offsets into the existing vectors.

1:
https://github.com/apache/arrow/blob/master/java/memory/memory-netty/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java
2:
https://github.com/apache/arrow/blob/master/java/memory/memory-unsafe/src/main/java/org/apache/arrow/memory/UnsafeAllocationManager.java

On Sat, Jul 25, 2020 at 5:46 AM Chris Nuernberger <chris@techascent.com>
wrote:

> Hey, I am the author to a Clojure dataframe library, tech.ml.dataset
> <https://github.com/techascent/tech.ml.dataset> and we are looking to
> upgrade our ability to handle out-of-memory datasets.
>
> I was hoping to use Arrow for this purpose specifically to have a
> conversion mechanism where I could stream data into a single Arrow file
> with multiple record batches and then load that file and mmap each record
> batch.
>
> The current loading mechanism appears quite poor for this use case; it
> assumes batch-at-a-time loading by mutating member variables of the root
> schema and file loading mechanism and it copies each batch into process
> memory.
>
> It seems to me that, assuming each batch is less than 2 GB,
> FileChannel.map could be used for each record batch and this would allow
> one to access data in those batches in a random-access order as opposed to
> a single in-order traverse and it may allow larger-than-memory files to be
> operated on.
>
> Is there any interest in this pathway? It seems like Arrow is quite close
> to realizing this possibility or that it is already possible from nearly
> all the other languages but the current Java design, unless I am misreading
> the code, precludes this pathway.
>
> Thanks for any thoughts, feedback,
>
> Chris
>

--00000000000017659005ab4c693d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The current code doesn&#39;t preclude this path, it just d=
oesn&#39;t have it implemented. In many cases, a more intelligent algorithm=
 can page data into or out of main memory more efficiently (albeit with mor=
e work). This should be fairly straightforward to do. The easiest way to ge=
t started would probably be to implement a new allocation manager that uses=
 MMap memory as backing instead of the current ones (Netty [1] and Unsafe [=
2]). From there, you could then enhance the reading to use that allocator t=
o map the right offsets into the existing vectors.<div><br></div><div>1:=C2=
=A0<a href=3D"https://github.com/apache/arrow/blob/master/java/memory/memor=
y-netty/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java">=
https://github.com/apache/arrow/blob/master/java/memory/memory-netty/src/ma=
in/java/org/apache/arrow/memory/NettyAllocationManager.java</a></div><div>2=
:=C2=A0<a href=3D"https://github.com/apache/arrow/blob/master/java/memory/m=
emory-unsafe/src/main/java/org/apache/arrow/memory/UnsafeAllocationManager.=
java">https://github.com/apache/arrow/blob/master/java/memory/memory-unsafe=
/src/main/java/org/apache/arrow/memory/UnsafeAllocationManager.java</a></di=
v></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr=
">On Sat, Jul 25, 2020 at 5:46 AM Chris Nuernberger &lt;<a href=3D"mailto:c=
hris@techascent.com">chris@techascent.com</a>&gt; wrote:<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"=
margin:0px 0px 1.2em">Hey, I am the author=C2=A0to a Clojure dataframe libr=
ary, <a href=3D"https://github.com/techascent/tech.ml.dataset" target=3D"_b=
lank">tech.ml.dataset</a> and we are looking to upgrade our ability to hand=
le out-of-memory datasets.</p>
<p style=3D"margin:0px 0px 1.2em">I was hoping to use Arrow for this purpos=
e specifically to have a conversion mechanism where I could stream data int=
o a single Arrow file with multiple record batches and then load that file =
and mmap each record batch.</p>
<p style=3D"margin:0px 0px 1.2em">The current loading mechanism appears qui=
te poor for this use case; it assumes batch-at-a-time loading by mutating m=
ember variables of the root schema and file loading mechanism and it copies=
 each batch into process memory.  </p>
<p style=3D"margin:0px 0px 1.2em">It seems to me that, assuming each batch =
is less than 2 GB, FileChannel.map could be used for each record batch and =
this would allow one to access data in those batches in a random-access ord=
er as opposed to a single in-order traverse and it may allow larger-than-me=
mory files to be operated on.  </p>
<p style=3D"margin:0px 0px 1.2em">Is there any interest in this pathway?  I=
t seems like Arrow is quite close to realizing this possibility or that it =
is already possible from nearly all the other=C2=A0languages but the curren=
t Java design, unless I am misreading the code, precludes this pathway.</p>
<p style=3D"margin:0px 0px 1.2em">Thanks for any thoughts, feedback,</p>
<p style=3D"margin:0px 0px 1.2em">Chris</p>
<div title=3D"MDH:SGV5LCBJIGFtIHRoZSBtYWluIGNvbnRyaWJ1dGUgdG8gYSBDbG9qdXJlI=
GRhdGFmcmFtZSBsaWJy
YXJ5LCBbdGVjaC5tbC5kYXRhc2V0XSg8YSBocmVmPSJodHRwczovL2dpdGh1Yi5jb20vdGVjaGF=
z
Y2VudC90ZWNoLm1sLmRhdGFzZXQiPmh0dHBzOi8vZ2l0aHViLmNvbS90ZWNoYXNjZW50L3RlY2g=
u
bWwuZGF0YXNldDwvYT4pIGFuZCB3ZSBhcmUgbG9va2luZyB0byB1cGdyYWRlIG91ciBhYmlsaXR=
5
IHRvIGhhbmRsZSBvdXQtb2YtbWVtb3J5IGRhdGFzZXRzLjxkaXY+PGJyPjwvZGl2PjxkaXY+SSB=
3
YXMgaG9waW5nIHRvIHVzZSBBcnJvdyBmb3IgdGhpcyBwdXJwb3NlIHNwZWNpZmljYWxseSB0byB=
o
YXZlIGEgY29udmVyc2lvbiBtZWNoYW5pc20gd2hlcmUgSSBjb3VsZCBzdHJlYW0gZGF0YSBpbnR=
v
IGEgc2luZ2xlIEFycm93IGZpbGUgd2l0aCA8c3BhbiB6ZXVtNGM4PSIxNTk1NjgwNjI2ODkzIiB=
k
YXRhLWRkbndhYj0iMTU5NTY4MDYyNjg5MyIgY2xhc3M9Im5nIiBkYXRhLXdwa2d2PSJ0cnVlIj5=
t
dWx0aXBsZTwvc3Bhbj4mbmJzcDtyZWNvcmQgYmF0Y2hlcyBhbmQgdGhlbiBsb2FkIHRoYXQgZml=
s
ZSBhbmQgbW1hcCBlYWNoIHJlY29yZCBiYXRjaC48L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PlR=
o
ZSBjdXJyZW50IGxvYWRpbmcgbWVjaGFuaXNtIGFwcGVhcnMgcXVpdGUgcG9vciBmb3IgdGhpcyB=
1
c2UgY2FzZTsgaXQgYXNzdW1lcyBiYXRjaC1hdC1hLXRpbWUgbG9hZGluZyBieSBtdXRhdGluZyB=
t
ZW1iZXIgdmFyaWFibGVzIG9mIHRoZSByb290IHNjaGVtYSBhbmQgZmlsZSBsb2FkaW5nIG1lY2h=
h
bmlzbSBhbmQgaXQgY29waWVzIGVhY2ggYmF0Y2ggaW50byBwcm9jZXNzIG1lbW9yeS4mbmJzcDs=
m
bmJzcDs8L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2Pkl0IHNlZW1zIHRvIG1lIHRoYXQgRmlsZUN=
o
YW5uZWwubWFwIGNvdWxkIGJlIHVzZWQgZm9yIGVhY2ggcmVjb3JkIGJhdGNoIGFuZCB0aGlzIHd=
v
dWxkIGFsbG93IG9uZSB0byBhY2Nlc3MgZGF0YSBpbiB0aG9zZSBiYXRjaGVzIGluIGEgcmFuZG9=
t
LWFjY2VzcyBvcmRlciBhcyBvcHBvc2VkIHRvIGEgc2luZ2xlIGluLW9yZGVyIHRyYXZlcnNlIGF=
u
ZCBpdCBtYXkgYWxsb3cgbGFyZ2VyLXRoYW4tbWVtb3J5IGZpbGVzIHRvIGJlIG9wZXJhdGVkIG9=
u
LiZuYnNwOyZuYnNwOzwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+SXMgdGhlcmUgYW55IGludGV=
y
ZXN0IGluIHRoaXMgcGF0aHdheT8mbmJzcDsgSXQgc2VlbXMgbGlrZSBBcnJvdyBpcyBxdWl0ZSB=
j
bG9zZSB0byByZWFsaXppbmcgdGhpcyBwb3NzaWJpbGl0eSBidXQgdGhlIGN1cnJlbnQgSmF2YSB=
k
ZXNpZ24sIHVubGVzcyBJIGFtIG1pc3JlYWRpbmcgdGhlIGNvZGUsJm5ic3A7cHJlY2x1ZGVzIHR=
o
aXMgcGF0aHdheS48L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PlRoYW5rcyBmb3IgYW55IHRob3V=
n
aHRzLCBmZWVkYmFjayw8L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PkNocmlzPC9kaXY+" style=
=3D"height:0px;width:0px;max-height:0px;max-width:0px;overflow:hidden;font-=
size:0em;padding:0px;margin:0px"></div></div></div>
</blockquote></div>

--00000000000017659005ab4c693d--