orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: String stats requirements?
Date Tue, 06 Jun 2017 22:36:27 GMT
On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom <dain@iq80.com> wrote:

> Is it required that the StringStatistics min and max be the actual min and
> max value for the column?  I ask for two reasons, I’d like to be able to
> “trim” values if the min or max is very large.  Also, as a work around of
> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
> first surrogate pair, so the value is slightly smaller than the min or
> larger than the max, and still a valid UTF-8 sequence.

I agree that we want to be able to trim the values. I've seen cases where
the String is huge (~100k) and makes the StringStatistics huge. I'd propose
that we do something like:

message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  // sum will store the total length of all strings in a stripe
  optional sint64 sum = 3;
  // if set, the minimum will not be set and the lowerBound <= all values
  optional string lowerBound = 4;
  // if set, the maximum will not be set and the upperBound >= all values
  optional string upperBound = 5;

We shouldn't have any UTF16 in ORC. Is there a case where we compare
strings that way? In particular, the StringStatistics uses Text, which uses
UTF-8 as its encoding.

.. Owen

> Thoughts?
> -dain

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message