orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From omalley <...@git.apache.org>
Subject [GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...
Date Tue, 26 Sep 2017 20:27:20 GMT
Github user omalley commented on the issue:

    Ok, starting with the representation. I'd suggest it look like:
        message StringStatistics {
          optional string minimum = 1;
          optional string maximum = 2;
          // sum will store the total length of all strings in a stripe
          optional sint64 sum = 3;
          // If the minimum or maximum value was longer than 1024 bytes, store a lower or
          // bound instead of the minimum or maximum values above.
          optional string lowerBound = 4;
          optional string upperBound = 5;
    Now obviously the lowerBound can just be the string truncated (at a utf8 character boundary!)
to at most 1024 bytes. The upperBound is the same with the last code point increased by one.
    In the StringStatisticsImpl, I'd keep two boolean flags as to whether it is a real value
or an approximation for minimum or maximum. The value comparison is the same, since unless
the current value is less than the lower bound, it won't change the lower bound and the same
is true for the upper bound. If the new minimum/maximum is not truncated, the corresponding
wasTruncated flag should be cleared. When merging, the flag follows the value. In the corner
case of two identical values where one was truncated, the non-truncated one is the result.
    We should end up with four methods for each:
    * String getMinimum();
    * String getLowerBound();
    * String getMaximum();
    * String getUpperBound();
    If we only have a lower bound, getMinimum should be null and the same with upper bound
and getMaximum. getLowerBound and getUpperBound should match getMinimum and getMaximum, if
no truncation was done.


View raw message