orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Majeti <majeti.dee...@gmail.com>
Subject Re: Re: Arrow Support of Orc
Date Mon, 16 Jul 2018 19:45:35 GMT
Follow up to Owen's question, do you have an estimate on the performance
gains from implementing the native support?

Creating a new API for supporting Arrow is a good starting point. Can you
come up with a design document first?

On Mon, Jul 16, 2018 at 4:24 AM 周宇睿(闻拙) <yurui.zyr@alibaba-inc.com> wrote:

> Hi All:
>
> Currently Arrow provides a naive implementation on converting
> ColumnVectorBatch to Arrow’s RecordBatch, which involves a lot of overheads
> on memcopying and transcodeing.
>
> We would like to add a native api set to allow user directly reading data
> from ORC file to Arrow’s RecordBatch, the new api set will be separated
> from current ColumnVectorBatch api so that we won’t raise any backward
> compatibility issue.
>
> Creating a new api set is not an elegent solution and it requires more
> maintenance effort. But given Arrow’s currently momentum and it’s benefits
> on sharing columnar data across various platforms and data format. We
> believe it worth to enable Arrow support on ORC.
>
> Any advice would be appreciated.
> Thanks
> Yurui
>
> from Alimail macOS
>  ------------------Original Mail ------------------
> Sender:Xiening Dai <xndai.git@live.com>
> Send Date:Fri Jul 6 01:25:34 2018
> Recipients:dev@orc.apache.org <dev@orc.apache.org>
> Subject:Re: Arrow Support of Orc
> I haven’t done profiling. The major overhead I can see is the conversion
> from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy
> and some transcoding. Also the current adapter only supports reading entire
> stripe as a batch, which in a lot of cases is not ideal. I agree that we
> should maintain backward compatibility. I am thinking if we could expose
> another set of interface for Arrow which is built on top of the same
> ColumnReader/ColumnWriter classes.
>
>
>
> > On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
> >
> > I think improved Arrow C++ integration would be great. I haven't looked
> at
> > the current state of the work to see what could be better. I'd be against
> > making Arrow the default C++ API, but changes to the API to make things
> > faster for Arrow make sense. (Although as always, we need to worry about
> > backwards compatibility.)
> >
> > Have you tried benchmarking and profiling the current adapters to see
> where
> > the bottlenecks are?
> >
> > .. Owen
> >
> > On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai.git@live.com> wrote:
> >
> >> Hi all,
> >>
> >> Not sure if this has been brought up before - do we have plan to support
> >> Apache Arrow? Given its popularity and momentum recently, we might
> consider
> >> supporting Arrow format for Orc reader and writer. There’s an adapter
> for
> >> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
> >> arrow/adapters/orc but the implementation is inefficient. If we want to
> >> better integrate with arrow, we should avoid conversions between
> >> ColumnVectorBatch and arrow format.
> >>
>
>

-- 
regards,
Deepak Majeti

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message