orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: String stats requirements?
Date Tue, 06 Jun 2017 23:11:35 GMT
Yes, HIVE-7144 went in before HIVE-8732 so any file with WriterVersion >= 1
should be UTF-8 in the statistics.

.. Owen

On Tue, Jun 6, 2017 at 4:05 PM, Dain Sundstrom <dain@iq80.com> wrote:

> Ah I see.  I can’t believe I missed this fix :)
>
> Our reader was originally written in the 0.13 days, and which used Strings
> for stats.  This is the commit that changed everything to text and I
> believe it went out with Hive 0.14:
>
>   https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15
> a88e975b4e#diff-340d190f994d92658b24aae1edf610b3
>
> Is writer version "1 = HIVE-8732 fixed” after 0.14?  If so I can update my
> reader to detect this.
>
> -dain
>
> > On Jun 6, 2017, at 3:36 PM, Owen O'Malley <owen.omalley@gmail.com>
> wrote:
> >
> > On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom <dain@iq80.com> wrote:
> >
> >> Is it required that the StringStatistics min and max be the actual min
> and
> >> max value for the column?  I ask for two reasons, I’d like to be able to
> >> “trim” values if the min or max is very large.  Also, as a work around
> of
> >> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
> >> first surrogate pair, so the value is slightly smaller than the min or
> >> larger than the max, and still a valid UTF-8 sequence.
> >>
> >
> > I agree that we want to be able to trim the values. I've seen cases where
> > the String is huge (~100k) and makes the StringStatistics huge. I'd
> propose
> > that we do something like:
> >
> > message StringStatistics {
> >  optional string minimum = 1;
> >  optional string maximum = 2;
> >  // sum will store the total length of all strings in a stripe
> >  optional sint64 sum = 3;
> >  // if set, the minimum will not be set and the lowerBound <= all values
> >  optional string lowerBound = 4;
> >  // if set, the maximum will not be set and the upperBound >= all values
> >  optional string upperBound = 5;
> > }
> >
> > We shouldn't have any UTF16 in ORC. Is there a case where we compare
> > strings that way? In particular, the StringStatistics uses Text, which
> uses
> > UTF-8 as its encoding.
> >
> > .. Owen
> >
> >
> >> Thoughts?
> >>
> >> -dain
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message