orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "周宇睿(闻拙)" <yurui....@alibaba-inc.com>
Subject Re: Re: Arrow Support of Orc
Date Mon, 16 Jul 2018 08:23:53 GMT
Hi All:

Currently Arrow provides a naive implementation on converting ColumnVectorBatch to Arrow’s
RecordBatch, which involves a lot of overheads on memcopying and transcodeing. 

We would like to add a native api set to allow user directly reading data from ORC file to
Arrow’s RecordBatch, the new api set will be separated from current ColumnVectorBatch api
so that we won’t raise any backward compatibility issue.

Creating a new api set is not an elegent solution and it requires more maintenance effort.
But given Arrow’s currently momentum and it’s benefits on sharing columnar data across
various platforms and data format. We believe it worth to enable Arrow support on ORC. 

Any advice would be appreciated.

from Alimail macOS
 ------------------Original Mail ------------------
Sender:Xiening Dai <xndai.git@live.com>
Send Date:Fri Jul 6 01:25:34 2018
Recipients:dev@orc.apache.org <dev@orc.apache.org>
Subject:Re: Arrow Support of Orc
I haven’t done profiling. The major overhead I can see is the conversion from ColumnVectorBatch
to Arrow’s RecordBatch, which involves memory copy and some transcoding. Also the current
adapter only supports reading entire stripe as a batch, which in a lot of cases is not ideal.
I agree that we should maintain backward compatibility. I am thinking if we could expose another
set of interface for Arrow which is built on top of the same ColumnReader/ColumnWriter classes.

> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> .. Owen
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai.git@live.com> wrote:
>> Hi all,
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message