arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: (java) Producing an in-memory Arrow buffer from a file
Date Sun, 02 Feb 2020 01:18:09 GMT
Hi Andrew,
Sorry for the late reply.


> I have the data stored in a heirarchy that is roughly table->columns->row
> ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
> each column's row range is stored and compressed separately, I could
> decompress them directly into an ArrowBuf (?) and then skip having to
> iterate over the values.

Yes based on the description this sounds like the right approach.

Depending on your end goal, you might want to stream the values through a
>> VectorSchemaRoot instead.
>> It appears (?) that this option also involves iterating over all the
>> values
>
> Yes.

>
> Looking at your examples and thinking about it conceptually, is there much
> of a difference between constructing a large ByteBuffer (or ArrowBuf) with
> the various messages inside it, and handing that to Arrow to parse or
> building the java-object-representation myself?


IMO (not an expert in Java library) is if you already have separate
bytebuffers then constructing the object representation yourself probably
makes sense.



On Fri, Jan 24, 2020 at 2:29 AM Andrew Melo <andrew.melo@gmail.com> wrote:

> Hi Micah,
>
> On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
>> Hi Andrew,
>> It might help to provide a little more detail on where you are starting
>> from and what you want to do once you have the data in arrow format.
>>
>
> Of course! Like I mentioned, particle physics data is processed in ROOT,
> which is a whole-stack solution -- from file I/O all the way up to plotting
> routines. There are a few different groups working on adopting non-physics
> tools like Spark or the scientific python ecosystem to process these data
> (so, still reading ROOT files, but doing the higher level interaction with
> different applications). I want to analyze these data with Spark, so I've
> implemented a (java-based) Spark DataSource which reads ROOT files. Some of
> my colleagues are experimenting with Kafka and were wondering if the same
> code could be re-used for them (they would like to put ROOT data into kafka
> topics, as I understand it).
>
> Currently, I parse the ROOT metadata to find where the value/offset
> buffers are within the file, then decompress the buffers and store them in
> an object hierarchy which I then use to implement the Spark API. I'd like
> to replace the intermediate object hierarchy with Arrow because
>
> 1) I could re-use the existing Spark code[1] to do the trudgework of
> extracting values from the buffers. That code is ~25% of my codebase
> 2) Adapting this code for different java-based applications becomes quite
> a bit easier. For example, Kafka supports Arrow-based sources, so adding
> kafka support would be relatively straightforward.
>
>
>>
>>  If you have the data already available in some sort of off-heap
>> datastructure you can potentially avoid copies wrap with the existing
>> ArrowBuf machinery [1].  If you have an iterator over the data you can also
>> directly build a ListVector [2].
>>
>
> I have the data stored in a heirarchy that is roughly table->columns->row
> ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since
> each column's row range is stored and compressed separately, I could
> decompress them directly into an ArrowBuf (?) and then skip having to
> iterate over the values.
>
>
>>
>> Depending on your end goal, you might want to stream the values through a
>> VectorSchemaRoot instead.
>>
>
> It appears (?) that this option also involves iterating over all the values
>
>
>>
>> There was some documentation written that will be published with the next
>> release that gives an overview of the Java libraries [3] that might be
>> helpful.
>>
>>
> I'll take a look at that, thanks!
>
> Looking at your examples and thinking about it conceptually, is there much
> of a difference between constructing a large ByteBuffer (or ArrowBuf) with
> the various messages inside it, and handing that to Arrow to parse or
> building the java-object-representation myself?
>
> Thanks for your patience,
> Andrew
>
> [1]
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java
>
>
>> Cheers,
>> Micah
>>
>> [1]
>> https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html
>> [2]
>> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java
>> [3] https://github.com/apache/arrow/tree/master/docs/source/java
>>
>> On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo <andrew.melo@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I work in particle physics, which has standardized on the ROOT (
>>> http://root.cern) file format to store/process our data. The format
>>> itself is quite complicated, but the relevant part here is that after
>>> parsing/decompression, we end up with value and offset buffers holding our
>>> data.
>>>
>>> What I'd like to do is represent these data in-memory in the Arrow
>>> format. I've written a very rough POC where I manually put an Arrow stream
>>> into a ByteBuffer, then replaced the values and offset buffers with the
>>> bytes from my files., and I'm wondering what's the "proper" way to do this
>>> is. From my reading of the code, it appears (?) that what I want to do is
>>> produce a org.apache.arrow.vector.types.pojo.Schema object, and N
>>> ArrowRecordBatch objects, then use MessageSerializer to stick them into a
>>> ByteBuffer one after each other.
>>>
>>> Is this correct? Or, is there another API I'm missing?
>>>
>>> Thanks!
>>> Andrew
>>>
>>

Mime
View raw message