orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
Date Tue, 02 Apr 2019 18:37:39 GMT
If it makes the integration between ORC C++ and Arrow easier, that is a
good thing. Please file an ORC jira and create a pull request when the work
is ready.

Thank you,
   Owen

On Tue, Apr 2, 2019 at 7:29 AM Yurui Zhou <yurui.zyr@alibaba-inc.com> wrote:

> Hi Owen,
>
> Thank you for the response. Yes, you are right, generally it doesn't save
> much
> memory between int64 to int16. But when it comes to vectorized
> computation,
> such a change may make big difference to cpu L1 cache.
>
> Another movitation for me to drive this change is that I am currently
> working on
> a copy free Arrow Adapter implementation for Apache Arrow to boost the
> performance
> of reading Orc file into Arrow Recordbatch.  The Arrow RecordBatch has
> strict
> mapping between type and data size. Currently in c++ orc reader, because
> the
> data type does not actually align with underlying data size, we need to
> perform
> a memory copy to finish the conversion, which involves unnecessary
> overhead.
>
> Regarding your concern about backward compatbility, we can certainly add a
> flag
> to make sure current user are not suffer from any API breaking.
>
> Thanks
> Yurui
>
> from Alimail macOS <https://mail.alibaba-inc.com>
>
> ------------------------------------------------------------------
> 发件人:Owen O'Malley<owen.omalley@gmail.com>
> 日 期:2019年04月02日 01:02:02
> 收件人:<dev@orc.apache.org>
> 主 题:Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
>
> From the ORC library side, it isn't hard to support the additional vector
> types, although you'll need to make it API compatible for users that don't
> want it. For applications, I don't see a lot of advantages. For 1024 rows,
> the savings in memory between int64, int32, int16, and byte isn't that much
> (8k is still pretty small). However, for the application, having to have
> different code paths for each of the four integer types is a big hassle.
> Certainly, Hive does not want the other vector types and therefore, I don't
> think we should make the change on the Java side. If an application has a
> compelling use case on the C++ side, we can do it. Another concern is that
> the C++ side doesn't do automatic schema evolution and therefore reading a
> file with int64 when you were expecting an int32 would currently work, but
> won't if you make the new types.
>
> .. Owen
>
> On Mon, Apr 1, 2019 at 6:30 PM Yurui Zhou <yurui.zyr@alibaba-inc.com
> > wrote:
>
> > Hi guys:
> >
> > Currently ORC have LongVectorBatch as the only representation for
>
> > primitive integer types like boolean, byte, int and long.  This is not very
> > benefitial for memory usage and computation efficiency. I would like to
> > introduce INT and BYTE vector batch in ORC C++ version  for types like
> > boolean, byte and int to improve the memory efficiency. This change would
> > also potential benefits for data consumer  in case of SIMD computation.
> > Let me know if you have any thoughts/suggestions.
> >
> > Thanks
> > Yurui
> >
> > from Alimail macOS
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message