orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Wu <gan...@apache.org>
Subject Re: Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
Date Tue, 02 Apr 2019 23:21:47 GMT
I am in favor of this change in the C++ codebase not only for its small
saving of runtime memory, also for getting rid of a lot of conversions.
Regarding to Owen's concern, I think we can do the followings:
1. Add a writer/reader option to use to new int/byte batch and by default
we still use the old approach to provide backward compatibility.
2. Restrict the flexibility of using new batch types. For example, we can
only use int32 batch for a column typed int32 but cannot use it for a
column typed int64, etc.

Thanks,
Gang

On Tue, Apr 2, 2019 at 11:37 AM Owen O'Malley <owen.omalley@gmail.com>
wrote:

> If it makes the integration between ORC C++ and Arrow easier, that is a
> good thing. Please file an ORC jira and create a pull request when the work
> is ready.
>
> Thank you,
>    Owen
>
> On Tue, Apr 2, 2019 at 7:29 AM Yurui Zhou <yurui.zyr@alibaba-inc.com>
> wrote:
>
> > Hi Owen,
> >
> > Thank you for the response. Yes, you are right, generally it doesn't save
> > much
> > memory between int64 to int16. But when it comes to vectorized
> > computation,
> > such a change may make big difference to cpu L1 cache.
> >
> > Another movitation for me to drive this change is that I am currently
> > working on
> > a copy free Arrow Adapter implementation for Apache Arrow to boost the
> > performance
> > of reading Orc file into Arrow Recordbatch.  The Arrow RecordBatch has
> > strict
> > mapping between type and data size. Currently in c++ orc reader, because
> > the
> > data type does not actually align with underlying data size, we need to
> > perform
> > a memory copy to finish the conversion, which involves unnecessary
> > overhead.
> >
> > Regarding your concern about backward compatbility, we can certainly add
> a
> > flag
> > to make sure current user are not suffer from any API breaking.
> >
> > Thanks
> > Yurui
> >
> > from Alimail macOS <https://mail.alibaba-inc.com>
> >
> > ------------------------------------------------------------------
> > 发件人:Owen O'Malley<owen.omalley@gmail.com>
> > 日 期:2019年04月02日 01:02:02
> > 收件人:<dev@orc.apache.org>
> > 主 题:Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
> >
> > From the ORC library side, it isn't hard to support the additional vector
> > types, although you'll need to make it API compatible for users that
> don't
> > want it. For applications, I don't see a lot of advantages. For 1024
> rows,
> > the savings in memory between int64, int32, int16, and byte isn't that
> much
> > (8k is still pretty small). However, for the application, having to have
> > different code paths for each of the four integer types is a big hassle.
> > Certainly, Hive does not want the other vector types and therefore, I
> don't
> > think we should make the change on the Java side. If an application has a
> > compelling use case on the C++ side, we can do it. Another concern is
> that
> > the C++ side doesn't do automatic schema evolution and therefore reading
> a
> > file with int64 when you were expecting an int32 would currently work,
> but
> > won't if you make the new types.
> >
> > .. Owen
> >
> > On Mon, Apr 1, 2019 at 6:30 PM Yurui Zhou <yurui.zyr@alibaba-inc.com
> > > wrote:
> >
> > > Hi guys:
> > >
> > > Currently ORC have LongVectorBatch as the only representation for
> >
> > > primitive integer types like boolean, byte, int and long.  This is not
> very
> > > benefitial for memory usage and computation efficiency. I would like to
> > > introduce INT and BYTE vector batch in ORC C++ version  for types like
> > > boolean, byte and int to improve the memory efficiency. This change
> would
> > > also potential benefits for data consumer  in case of SIMD computation.
> > > Let me know if you have any thoughts/suggestions.
> > >
> > > Thanks
> > > Yurui
> > >
> > > from Alimail macOS
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message