orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch
Date Mon, 01 Apr 2019 17:02:02 GMT
>From the ORC library side, it isn't hard to support the additional vector
types, although you'll need to make it API compatible for users that don't
want it. For applications, I don't see a lot of advantages. For 1024 rows,
the savings in memory between int64, int32, int16, and byte isn't that much
(8k is still pretty small). However, for the application, having to have
different code paths for each of the four integer types is a big hassle.
Certainly, Hive does not want the other vector types and therefore, I don't
think we should make the change on the Java side. If an application has a
compelling use case on the C++ side, we can do it. Another concern is that
the C++ side doesn't do automatic schema evolution and therefore reading a
file with int64 when you were expecting an int32 would currently work, but
won't if you make the new types.

.. Owen

On Mon, Apr 1, 2019 at 6:30 PM Yurui Zhou <yurui.zyr@alibaba-inc.com> wrote:

> Hi guys:
>
> Currently ORC have LongVectorBatch as the only representation for
> primitive integer types like boolean, byte, int and long.  This is not very
> benefitial for memory usage and computation efficiency. I would like to
> introduce INT and BYTE vector batch in ORC C++ version  for types like
> boolean, byte and int to improve the memory efficiency. This change would
> also potential benefits for data consumer  in case of SIMD computation.
> Let me know if you have any thoughts/suggestions.
>
> Thanks
> Yurui
>
> from Alimail macOS

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message