arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Melo <andrew.m...@gmail.com>
Subject Re: (java) Producing an in-memory Arrow buffer from a file
Date Fri, 24 Jan 2020 10:28:54 GMT
Hi Micah,

On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Andrew,
> It might help to provide a little more detail on where you are starting
> from and what you want to do once you have the data in arrow format.
>

Of course! Like I mentioned, particle physics data is processed in ROOT,
which is a whole-stack solution -- from file I/O all the way up to plotting
routines. There are a few different groups working on adopting non-physics
tools like Spark or the scientific python ecosystem to process these data
(so, still reading ROOT files, but doing the higher level interaction with
different applications). I want to analyze these data with Spark, so I've
implemented a (java-based) Spark DataSource which reads ROOT files. Some of
my colleagues are experimenting with Kafka and were wondering if the same
code could be re-used for them (they would like to put ROOT data into kafka
topics, as I understand it).

Currently, I parse the ROOT metadata to find where the value/offset buffers
are within the file, then decompress the buffers and store them in an
object hierarchy which I then use to implement the Spark API. I'd like to
replace the intermediate object hierarchy with Arrow because

1) I could re-use the existing Spark code[1] to do the trudgework of
extracting values from the buffers. That code is ~25% of my codebase
2) Adapting this code for different java-based applications becomes quite a
bit easier. For example, Kafka supports Arrow-based sources, so adding
kafka support would be relatively straightforward.


>
>  If you have the data already available in some sort of off-heap
> datastructure you can potentially avoid copies wrap with the existing
> ArrowBuf machinery [1].  If you have an iterator over the data you can also
> directly build a ListVector [2].
>

I have the data stored in a heirarchy that is roughly table->columns->row
ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
each column's row range is stored and compressed separately, I could
decompress them directly into an ArrowBuf (?) and then skip having to
iterate over the values.


>
> Depending on your end goal, you might want to stream the values through a
> VectorSchemaRoot instead.
>

It appears (?) that this option also involves iterating over all the values


>
> There was some documentation written that will be published with the next
> release that gives an overview of the Java libraries [3] that might be
> helpful.
>
>
I'll take a look at that, thanks!

Looking at your examples and thinking about it conceptually, is there much
of a difference between constructing a large ByteBuffer (or ArrowBuf) with
the various messages inside it, and handing that to Arrow to parse or
building the java-object-representation myself?

Thanks for your patience,
Andrew

[1]
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java


> Cheers,
> Micah
>
> [1]
> https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
> [2]
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
> [3] https://github.com/apache/arrow/tree/master/docs/source/java
>
> On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <andrew.melo@gmail.com> wrote:
>
>> Hello all,
>>
>> I work in particle physics, which has standardized on the ROOT (
>> http://root.cern) file format to store/process our data. The format
>> itself is quite complicated, but the relevant part here is that after
>> parsing/decompression, we end up with value and offset buffers holding our
>> data.
>>
>> What I'd like to do is represent these data in-memory in the Arrow
>> format. I've written a very rough POC where I manually put an Arrow stream
>> into a ByteBuffer, then replaced the values and offset buffers with the
>> bytes from my files., and I'm wondering what's the "proper" way to do this
>> is. From my reading of the code, it appears (?) that what I want to do is
>> produce a org.apache.arrow.vector.types.pojo.Schema object, and N
>> ArrowRecordBatch objects, then use MessageSerializer to stick them into a
>> ByteBuffer one after each other.
>>
>> Is this correct? Or, is there another API I'm missing?
>>
>> Thanks!
>> Andrew
>>
>

Mime
View raw message