orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: Type length, scale, and precision?
Date Tue, 02 Apr 2019 22:47:31 GMT
Sorry, I managed to miss this message.

On Tue, Mar 19, 2019 at 9:31 PM Dain Sundstrom <dain@iq80.com> wrote:

> For the types in the ORC footer, we have the following:
>  // the maximum length of the type for varchar or char in UTF-8 characters
>  optional uint32 maximumLength = 4;
>  // the precision and scale for decimal
>  optional uint32 precision = 5;
>  optional uint32 scale = 6;
> If the maximumLength, is set to N, can I be confident that no value for
> that column in the file will contain more than N UTF-8 characters?  Is this
> still true for concatenated ORC files.

Yes. The merger should insist that the schemas are the same for all merged
files. We could consider loosening that restriction, but in all cases the
length of the values must be less than the declared length in the footer.

Until recently we had a bug that was truncating to N bytes instead of N
UTF-8 characters. That was a mistake.

> I have a similar question about DECIMAL.  Decimal encoding currently uses
> the SECONDARY stream to encode the "scale".  Is this scale guaranteed to be
> the same scale as the type scale in the footer?

In Hive 0.11 the decimal values didn't have a declared scale. That is why
the scale is encoded per a value. For short decimals (p <= 18) in recent
Hive/ORC versions, you'll have that guarantee. Otherwise, it still uses the
HiveDecimalWritable code, which removes trailing zeros, so the scale for a
value may be less than the declared scale.

> Thanks,
> -dain
> ----
> Dain Sundstrom
> Co-founder @ Presto Software Foundation, Co-creator of Presto (
> https://prestosql.io)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message