orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiening Dai <xndai....@live.com>
Subject Re: Arrow Support of Orc
Date Thu, 05 Jul 2018 17:25:20 GMT
I haven’t done profiling. The major overhead I can see is the conversion from ColumnVectorBatch
to Arrow’s RecordBatch, which involves memory copy and some transcoding. Also the current
adapter only supports reading entire stripe as a batch, which in a lot of cases is not ideal.
I agree that we should maintain backward compatibility. I am thinking if we could expose another
set of interface for Arrow which is built on top of the same ColumnReader/ColumnWriter classes.



> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> 
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> 
> .. Owen
> 
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai.git@live.com> wrote:
> 
>> Hi all,
>> 
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.
>> 

Mime
View raw message