From user-return-293-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Fri Jan 24 10:29:18 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7F7A218064E for ; Fri, 24 Jan 2020 11:29:18 +0100 (CET) Received: (qmail 32621 invoked by uid 500); 24 Jan 2020 10:29:17 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 32611 invoked by uid 99); 24 Jan 2020 10:29:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2020 10:29:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 012A8C095D for ; Fri, 24 Jan 2020 10:29:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id vuK_HplZC1US for ; Fri, 24 Jan 2020 10:29:14 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::a42; helo=mail-vk1-xa42.google.com; envelope-from=andrew.melo@gmail.com; receiver= Received: from mail-vk1-xa42.google.com (mail-vk1-xa42.google.com [IPv6:2607:f8b0:4864:20::a42]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id DEDED7E13C for ; Fri, 24 Jan 2020 10:29:13 +0000 (UTC) Received: by mail-vk1-xa42.google.com with SMTP id b129so393674vka.4 for ; Fri, 24 Jan 2020 02:29:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=SJevI/sRivF7ZzDFO4rtw9SYCYOuvAhHbghvPMyZEpw=; b=As+taF0eNtUVWu/N3cBwytTk27FHf10NYRoeSkhl0df9c4haQ/IBDpYKR15mHhCghg RcgiC6jh8KCA8v+KmIZHLbNKDDjaoCkUBOGv3Ns4s0UMPuHkoy6Ta4l3+gtC46Ba1rN3 Xb8HVRieGjUrz87g05TglzKGUVeXYUCYEBkTh5W1sSKO3KjpWmUJMh9gIOJYuELid2rU yqWn3sqYaSNOxnIkStZqlM+GQRULkZJf1jHnXxP5SSH3eEpdWu6sA7Ej3Vv7OWPKn1yj RJ7HHxDi3aTJSZiKYlNVJ513GkwC6fuw8hRQ/EnfooTspIVJAnRHQLlOZrkKhnbb+4/L zrXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=SJevI/sRivF7ZzDFO4rtw9SYCYOuvAhHbghvPMyZEpw=; b=aPkpr4UBfhlcdvbth6vChEsTasZFdFeqkCVmLZoSBhRVKAW7qDZ0rFyuC28bfpmnr+ eelLqzhXvbuFRZ65FkiNvfvrJ0b5Yc9YKefZF3aeLnkcZekqacAR+FaM27SSxcYvk+JO zUoDYJYQqOi7P6OsezAF3LfXFmvpDXiDJo/BuagRcZj2kUjOXLvg0ZMS/HRVul4tRs+o L0oQHMplk5rfelpBBodXPI3/JxIPp3xBAnm9mkQcx80T4Ymg5QunkbNEeR7dDrWscnkU VHndzdHWdGyFiClinMCNxHZXi021DjH+VhNljecCTtE0VTYexqDJAnrdXjmTIE8nspKB g5Cw== X-Gm-Message-State: APjAAAU9dBAcysuiDC91IxF7k9n1po/Zfp2i2ghOjBZM1fneK8aJl4SK zv0DtO3ns7yxvsJesVp0R5vTMpAL/mive0ldzDYy+ErQmRk= X-Google-Smtp-Source: APXvYqxgRR9DpPwL1O8BV5slo3q9xzXUFHIIkp96qXjg+brpjOm4thYXBhtg38alG5PZXn0En7B8zQYAQ3fH6HQYmQA= X-Received: by 2002:ac5:c9a7:: with SMTP id f7mr1535678vkm.58.1579861745701; Fri, 24 Jan 2020 02:29:05 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Andrew Melo Date: Fri, 24 Jan 2020 11:28:54 +0100 Message-ID: Subject: Re: (java) Producing an in-memory Arrow buffer from a file To: user@arrow.apache.org, emkornfield@gmail.com Content-Type: multipart/alternative; boundary="0000000000002595dd059ce03a55" --0000000000002595dd059ce03a55 Content-Type: text/plain; charset="UTF-8" Hi Micah, On Fri, Jan 24, 2020 at 6:17 AM Micah Kornfield wrote: > Hi Andrew, > It might help to provide a little more detail on where you are starting > from and what you want to do once you have the data in arrow format. > Of course! Like I mentioned, particle physics data is processed in ROOT, which is a whole-stack solution -- from file I/O all the way up to plotting routines. There are a few different groups working on adopting non-physics tools like Spark or the scientific python ecosystem to process these data (so, still reading ROOT files, but doing the higher level interaction with different applications). I want to analyze these data with Spark, so I've implemented a (java-based) Spark DataSource which reads ROOT files. Some of my colleagues are experimenting with Kafka and were wondering if the same code could be re-used for them (they would like to put ROOT data into kafka topics, as I understand it). Currently, I parse the ROOT metadata to find where the value/offset buffers are within the file, then decompress the buffers and store them in an object hierarchy which I then use to implement the Spark API. I'd like to replace the intermediate object hierarchy with Arrow because 1) I could re-use the existing Spark code[1] to do the trudgework of extracting values from the buffers. That code is ~25% of my codebase 2) Adapting this code for different java-based applications becomes quite a bit easier. For example, Kafka supports Arrow-based sources, so adding kafka support would be relatively straightforward. > > If you have the data already available in some sort of off-heap > datastructure you can potentially avoid copies wrap with the existing > ArrowBuf machinery [1]. If you have an iterator over the data you can also > directly build a ListVector [2]. > I have the data stored in a heirarchy that is roughly table->columns->row ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since each column's row range is stored and compressed separately, I could decompress them directly into an ArrowBuf (?) and then skip having to iterate over the values. > > Depending on your end goal, you might want to stream the values through a > VectorSchemaRoot instead. > It appears (?) that this option also involves iterating over all the values > > There was some documentation written that will be published with the next > release that gives an overview of the Java libraries [3] that might be > helpful. > > I'll take a look at that, thanks! Looking at your examples and thinking about it conceptually, is there much of a difference between constructing a large ByteBuffer (or ArrowBuf) with the various messages inside it, and handing that to Arrow to parse or building the java-object-representation myself? Thanks for your patience, Andrew [1] https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java > Cheers, > Micah > > [1] > https://javadoc.io/static/org.apache.arrow/arrow-memory/0.15.1/io/netty/buffer/ArrowBuf.html > [2] > https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java > [3] https://github.com/apache/arrow/tree/master/docs/source/java > > On Thu, Jan 23, 2020 at 5:02 AM Andrew Melo wrote: > >> Hello all, >> >> I work in particle physics, which has standardized on the ROOT ( >> http://root.cern) file format to store/process our data. The format >> itself is quite complicated, but the relevant part here is that after >> parsing/decompression, we end up with value and offset buffers holding our >> data. >> >> What I'd like to do is represent these data in-memory in the Arrow >> format. I've written a very rough POC where I manually put an Arrow stream >> into a ByteBuffer, then replaced the values and offset buffers with the >> bytes from my files., and I'm wondering what's the "proper" way to do this >> is. From my reading of the code, it appears (?) that what I want to do is >> produce a org.apache.arrow.vector.types.pojo.Schema object, and N >> ArrowRecordBatch objects, then use MessageSerializer to stick them into a >> ByteBuffer one after each other. >> >> Is this correct? Or, is there another API I'm missing? >> >> Thanks! >> Andrew >> > --0000000000002595dd059ce03a55 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Micah,

On Fri, Jan 24, 2020 at 6:17 AM = Micah Kornfield <emkornfield@gm= ail.com> wrote:
Hi Andrew,
It might help to provide a little mo= re detail on where you are starting from and what you want to do once you h= ave the data in arrow format.

O= f course! Like I mentioned, particle physics data is processed in ROOT, whi= ch is a whole-stack solution -- from file I/O all the way up to plotting ro= utines. There are a few different groups working on adopting non-physics to= ols like Spark or the scientific python ecosystem to process these data (so= , still reading ROOT files, but doing the higher level interaction with dif= ferent applications). I want to analyze these data with Spark, so I've = implemented a (java-based) Spark DataSource which reads ROOT files. Some of= my colleagues are experimenting with Kafka and were wondering if the same = code could be re-used for them (they would like to put ROOT data into kafka= topics, as I understand it).

Currently, I parse t= he ROOT metadata to find where the value/offset buffers are within the file= , then decompress the buffers and store them in an object hierarchy which I= then use to implement the Spark API. I'd like to replace the intermedi= ate object hierarchy with Arrow because

1) I could= re-use the existing Spark code[1] to do the trudgework of extracting value= s from the buffers. That code is ~25% of my codebase
2) Adapting = this code for different java-based applications becomes quite a bit easier.= For example, Kafka supports Arrow-based sources, so adding kafka support w= ould be relatively straightforward.
=C2=A0

=C2= =A0If you have the data already available in some sort of off-heap datastru= cture you can potentially avoid copies wrap with the existing ArrowBuf mach= inery [1].=C2=A0 If you have an iterator over the data you can also directl= y build a ListVector [2].

I hav= e the data stored in a heirarchy that is roughly table->columns->row = ranges->ByteBuffer, so I presume ArrowBuf is the right direction. Since = each column's row range is stored and compressed separately, I could de= compress them directly into an ArrowBuf (?) and then skip having to iterate= over the values.
=C2=A0

Depending on your e= nd goal, you might want to stream the values through a VectorSchemaRoot ins= tead.=C2=A0

It appears (?) that= this option also involves iterating over all the values
=C2=A0

There was some documentation written that will be published= with the next release that gives an overview of the Java libraries [3] tha= t might be helpful.


<= div>I'll take a look at that, thanks!

Looking = at your examples and thinking about it conceptually, is there much of a dif= ference between constructing a large ByteBuffer (or ArrowBuf) with the vari= ous messages inside it, and handing that to Arrow to parse or building the = java-object-representation myself?

Thanks for your= patience,
Andrew

=C2=A0
Cheers,
Micah


On Thu, Jan 23, 20= 20 at 5:02 AM Andrew Melo <andrew.melo@gmail.com> wrote:
Hello all,

I work in particle physics, which has standardized on the ROOT (http://root.cern) file forma= t to store/process our data. The format itself is quite complicated, but th= e relevant part here is that after parsing/decompression, we end up with va= lue and offset buffers holding our data.

What I= 9;d like to do is represent these data in-memory in the Arrow format. I'= ;ve written a very rough POC where I manually put an Arrow stream into a By= teBuffer, then replaced the values and offset buffers with the bytes from m= y files., and I'm wondering what's the "proper" way to do= this is. From my reading of the code, it appears (?) that what I want to d= o is produce a org.apache.arrow.vector.types.pojo.Schema object, and N Arro= wRecordBatch objects, then use MessageSerializer to stick them into a ByteB= uffer one after each other.

Is this correct? Or, i= s there another API I'm missing?

Thanks!
=
Andrew
--0000000000002595dd059ce03a55--