From user-return-293-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Fri Jan 24 10:29:18 2020
Return-Path: <user-return-293-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 7F7A218064E
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 24 Jan 2020 11:29:18 +0100 (CET)
Received: (qmail 32621 invoked by uid 500); 24 Jan 2020 10:29:17 -0000
Mailing-List: contact user-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@arrow.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@arrow.apache.org>
List-Post: <mailto:user@arrow.apache.org>
List-Id: <user.arrow.apache.org>
Reply-To: user@arrow.apache.org
Delivered-To: mailing list user@arrow.apache.org
Received: (qmail 32611 invoked by uid 99); 24 Jan 2020 10:29:17 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2020 10:29:17 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 012A8C095D
	for <user@arrow.apache.org>; Fri, 24 Jan 2020 10:29:17 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.001
X-Spam-Level:
X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id vuK_HplZC1US for <user@arrow.apache.org>;
	Fri, 24 Jan 2020 10:29:14 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::a42; helo=mail-vk1-xa42.google.com; envelope-from=andrew.melo@gmail.com; receiver=<UNKNOWN> 
Received: from mail-vk1-xa42.google.com (mail-vk1-xa42.google.com [IPv6:2607:f8b0:4864:20::a42])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id DEDED7E13C
	for <user@arrow.apache.org>; Fri, 24 Jan 2020 10:29:13 +0000 (UTC)
Received: by mail-vk1-xa42.google.com with SMTP id b129so393674vka.4
        for <user@arrow.apache.org>; Fri, 24 Jan 2020 02:29:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=SJevI/sRivF7ZzDFO4rtw9SYCYOuvAhHbghvPMyZEpw=;
        b=As+taF0eNtUVWu/N3cBwytTk27FHf10NYRoeSkhl0df9c4haQ/IBDpYKR15mHhCghg
         RcgiC6jh8KCA8v+KmIZHLbNKDDjaoCkUBOGv3Ns4s0UMPuHkoy6Ta4l3+gtC46Ba1rN3
         Xb8HVRieGjUrz87g05TglzKGUVeXYUCYEBkTh5W1sSKO3KjpWmUJMh9gIOJYuELid2rU
         yqWn3sqYaSNOxnIkStZqlM+GQRULkZJf1jHnXxP5SSH3eEpdWu6sA7Ej3Vv7OWPKn1yj
         RJ7HHxDi3aTJSZiKYlNVJ513GkwC6fuw8hRQ/EnfooTspIVJAnRHQLlOZrkKhnbb+4/L
         zrXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=SJevI/sRivF7ZzDFO4rtw9SYCYOuvAhHbghvPMyZEpw=;
        b=aPkpr4UBfhlcdvbth6vChEsTasZFdFeqkCVmLZoSBhRVKAW7qDZ0rFyuC28bfpmnr+
         eelLqzhXvbuFRZ65FkiNvfvrJ0b5Yc9YKefZF3aeLnkcZekqacAR+FaM27SSxcYvk+JO
         zUoDYJYQqOi7P6OsezAF3LfXFmvpDXiDJo/BuagRcZj2kUjOXLvg0ZMS/HRVul4tRs+o
         L0oQHMplk5rfelpBBodXPI3/JxIPp3xBAnm9mkQcx80T4Ymg5QunkbNEeR7dDrWscnkU
         VHndzdHWdGyFiClinMCNxHZXi021DjH+VhNljecCTtE0VTYexqDJAnrdXjmTIE8nspKB
         g5Cw==
X-Gm-Message-State: APjAAAU9dBAcysuiDC91IxF7k9n1po/Zfp2i2ghOjBZM1fneK8aJl4SK
	zv0DtO3ns7yxvsJesVp0R5vTMpAL/mive0ldzDYy+ErQmRk=
X-Google-Smtp-Source: APXvYqxgRR9DpPwL1O8BV5slo3q9xzXUFHIIkp96qXjg+brpjOm4thYXBhtg38alG5PZXn0En7B8zQYAQ3fH6HQYmQA=
X-Received: by 2002:ac5:c9a7:: with SMTP id f7mr1535678vkm.58.1579861745701;
 Fri, 24 Jan 2020 02:29:05 -0800 (PST)
MIME-Version: 1.0
References: <CAJY4aWF7vqG30wekac2pN-Fo4pV2FWqTutKhKYVq0Z3FoGaDZQ@mail.gmail.com>
 <CAK7Z5T8fdm=-arbCdYCx6443tnyNz8o0SYsiCi7P86sE2BqA-w@mail.gmail.com>
In-Reply-To: <CAK7Z5T8fdm=-arbCdYCx6443tnyNz8o0SYsiCi7P86sE2BqA-w@mail.gmail.com>
From: Andrew Melo <andrew.melo@gmail.com>
Date: Fri, 24 Jan 2020 11:28:54 +0100
Message-ID: <CAJY4aWGKFa7R3xivSVivFxRHZWw3t0wrrLyyhZYgja+iMCBvCw@mail.gmail.com>
Subject: Re: (java) Producing an in-memory Arrow buffer from a file
To: user@arrow.apache.org, emkornfield@gmail.com
Content-Type: multipart/alternative; boundary="0000000000002595dd059ce03a55"

--0000000000002595dd059ce03a55
Content-Type: text/plain; charset="UTF-8"

Hi Micah,

On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Andrew,
> It might help to provide a little more detail on where you are starting
> from and what you want to do once you have the data in arrow format.
>

Of course! Like I mentioned, particle physics data is processed in ROOT,
which is a whole-stack solution -- from file I/O all the way up to plotting
routines. There are a few different groups working on adopting non-physics
tools like Spark or the scientific python ecosystem to process these data
(so, still reading ROOT files, but doing the higher level interaction with
different applications). I want to analyze these data with Spark, so I've
implemented a (java-based) Spark DataSource which reads ROOT files. Some of
my colleagues are experimenting with Kafka and were wondering if the same
code could be re-used for them (they would like to put ROOT data into kafka
topics, as I understand it).

Currently, I parse the ROOT metadata to find where the value/offset buffers
are within the file, then decompress the buffers and store them in an
object hierarchy which I then use to implement the Spark API. I'd like to
replace the intermediate object hierarchy with Arrow because

1) I could re-use the existing Spark code[1] to do the trudgework of
extracting values from the buffers. That code is ~25% of my codebase
2) Adapting this code for different java-based applications becomes quite a
bit easier. For example, Kafka supports Arrow-based sources, so adding
kafka support would be relatively straightforward.


>
>  If you have the data already available in some sort of off-heap
> datastructure you can potentially avoid copies wrap with the existing
> ArrowBuf machinery [1].  If you have an iterator over the data you can also
> directly build a ListVector [2].
>

I have the data stored in a heirarchy that is roughly table->columns->row
ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
each column's row range is stored and compressed separately, I could
decompress them directly into an ArrowBuf (?) and then skip having to
iterate over the values.


>
> Depending on your end goal, you might want to stream the values through a
> VectorSchemaRoot instead.
>

It appears (?) that this option also involves iterating over all the values


>
> There was some documentation written that will be published with the next
> release that gives an overview of the Java libraries [3] that might be
> helpful.
>
>
I'll take a look at that, thanks!

Looking at your examples and thinking about it conceptually, is there much
of a difference between constructing a large ByteBuffer (or ArrowBuf) with
the various messages inside it, and handing that to Arrow to parse or
building the java-object-representation myself?

Thanks for your patience,
Andrew

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java


> Cheers,
> Micah
>
> [1]
> https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
> [2]
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
> [3] https://github.com/apache/arrow/tree/master/docs/source/java
>
> On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <andrew.melo@gmail.com> wrote:
>
>> Hello all,
>>
>> I work in particle physics, which has standardized on the ROOT (
>> http://root.cern) file format to store/process our data. The format
>> itself is quite complicated, but the relevant part here is that after
>> parsing/decompression, we end up with value and offset buffers holding our
>> data.
>>
>> What I'd like to do is represent these data in-memory in the Arrow
>> format. I've written a very rough POC where I manually put an Arrow stream
>> into a ByteBuffer, then replaced the values and offset buffers with the
>> bytes from my files., and I'm wondering what's the "proper" way to do this
>> is. From my reading of the code, it appears (?) that what I want to do is
>> produce a org.apache.arrow.vector.types.pojo.Schema object, and N
>> ArrowRecordBatch objects, then use MessageSerializer to stick them into a
>> ByteBuffer one after each other.
>>
>> Is this correct? Or, is there another API I'm missing?
>>
>> Thanks!
>> Andrew
>>
>

--0000000000002595dd059ce03a55
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Hi Micah,</div><br><div class=3D"gmail_qu=
ote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 24, 2020 at 6:17 AM =
Micah Kornfield &lt;<a href=3D"mailto:emkornfield@gmail.com">emkornfield@gm=
ail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr">Hi Andrew,<div>It might help to provide a little mo=
re detail on where you are starting from and what you want to do once you h=
ave the data in arrow format.</div></div></blockquote><div><br></div><div>O=
f course! Like I mentioned, particle physics data is processed in ROOT, whi=
ch is a whole-stack solution -- from file I/O all the way up to plotting ro=
utines. There are a few different groups working on adopting non-physics to=
ols like Spark or the scientific python ecosystem to process these data (so=
, still reading ROOT files, but doing the higher level interaction with dif=
ferent applications). I want to analyze these data with Spark, so I&#39;ve =
implemented a (java-based) Spark DataSource which reads ROOT files. Some of=
 my colleagues are experimenting with Kafka and were wondering if the same =
code could be re-used for them (they would like to put ROOT data into kafka=
 topics, as I understand it).</div><div><br></div><div>Currently, I parse t=
he ROOT metadata to find where the value/offset buffers are within the file=
, then decompress the buffers and store them in an object hierarchy which I=
 then use to implement the Spark API. I&#39;d like to replace the intermedi=
ate object hierarchy with Arrow because</div><div><br></div><div>1) I could=
 re-use the existing Spark code[1] to do the trudgework of extracting value=
s from the buffers. That code is ~25% of my codebase</div><div>2) Adapting =
this code for different java-based applications becomes quite a bit easier.=
 For example, Kafka supports Arrow-based sources, so adding kafka support w=
ould be relatively straightforward.</div><div>=C2=A0</div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>=C2=
=A0If you have the data already available in some sort of off-heap datastru=
cture you can potentially avoid copies wrap with the existing ArrowBuf mach=
inery [1].=C2=A0 If you have an iterator over the data you can also directl=
y build a ListVector [2].</div></div></blockquote><div><br></div><div>I hav=
e the data stored in a heirarchy that is roughly table-&gt;columns-&gt;row =
ranges-&gt;ByteBuffer, so I presume ArrowBuf is the right direction. Since =
each column&#39;s row range is stored and compressed separately, I could de=
compress them directly into an ArrowBuf (?) and then skip having to iterate=
 over the values.</div><div>=C2=A0<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex"><div dir=3D"ltr"><div><br></div><div>Depending on your e=
nd goal, you might want to stream the values through a VectorSchemaRoot ins=
tead.=C2=A0</div></div></blockquote><div><br></div><div>It appears (?) that=
 this option also involves iterating over all the values</div><div>=C2=A0</=
div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div=
><br></div><div>There was some documentation written that will be published=
 with the next release that gives an overview of the Java libraries [3] tha=
t might be helpful.</div><div><br></div></div></blockquote><div><br></div><=
div>I&#39;ll take a look at that, thanks!</div><div><br></div><div>Looking =
at your examples and thinking about it conceptually, is there much of a dif=
ference between constructing a large ByteBuffer (or ArrowBuf) with the vari=
ous messages inside it, and handing that to Arrow to parse or building the =
java-object-representation myself?</div><div><br></div><div>Thanks for your=
 patience,</div><div>Andrew</div><div><br></div><div>[1]=C2=A0<a href=3D"ht=
tps://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/ap=
ache/spark/sql/vectorized/ArrowColumnVector.java">https://github.com/apache=
/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectoriz=
ed/ArrowColumnVector.java</a></div><div>=C2=A0</div><blockquote class=3D"gm=
ail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,=
204,204);padding-left:1ex"><div dir=3D"ltr"><div></div><div>Cheers,</div><d=
iv>Micah</div><div><br></div><div>[1]=C2=A0<a href=3D"https://javadoc.io/st=
atic/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html" ta=
rget=3D"_blank">https://javadoc.io/static/org.apache.arrow/arrow-memory/0.1=
5.1/io/netty/buffer/ArrowBuf.html</a></div><div>[2]=C2=A0<a href=3D"https:/=
/github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/a=
rrow/vector/complex/ListVector.java" target=3D"_blank">https://github.com/a=
pache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/c=
omplex/ListVector.java</a></div><div>[3]=C2=A0<a href=3D"https://github.com=
/apache/arrow/tree/master/docs/source/java" target=3D"_blank">https://githu=
b.com/apache/arrow/tree/master/docs/source/java</a></div></div><br><div cla=
ss=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Jan 23, 20=
20 at 5:02 AM Andrew Melo &lt;<a href=3D"mailto:andrew.melo@gmail.com" targ=
et=3D"_blank">andrew.melo@gmail.com</a>&gt; wrote:<br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Hello all,<div><br></di=
v><div>I work in particle physics, which has standardized on the ROOT (<a h=
ref=3D"http://root.cern" target=3D"_blank">http://root.cern</a>) file forma=
t to store/process our data. The format itself is quite complicated, but th=
e relevant part here is that after parsing/decompression, we end up with va=
lue and offset buffers holding our data.</div><div><br></div><div>What I=
9;d like to do is represent these data in-memory in the Arrow format. I&#39=
;ve written a very rough POC where I manually put an Arrow stream into a By=
teBuffer, then replaced the values and offset buffers with the bytes from m=
y files., and I&#39;m wondering what&#39;s the &quot;proper&quot; way to do=
 this is. From my reading of the code, it appears (?) that what I want to d=
o is produce a org.apache.arrow.vector.types.pojo.Schema object, and N Arro=
wRecordBatch objects, then use MessageSerializer to stick them into a ByteB=
uffer one after each other.</div><div><br></div><div>Is this correct? Or, i=
s there another API I&#39;m missing?</div><div><br></div><div>Thanks!</div>=
<div>Andrew</div>





</div>
</blockquote></div>
</blockquote></div></div>

--0000000000002595dd059ce03a55--