Github user omalley commented on the issue:
https://github.com/apache/orc/pull/169
Ok, starting with the representation. I'd suggest it look like:
message StringStatistics {
optional string minimum = 1;
optional string maximum = 2;
// sum will store the total length of all strings in a stripe
optional sint64 sum = 3;
// If the minimum or maximum value was longer than 1024 bytes, store a lower or
upper
// bound instead of the minimum or maximum values above.
optional string lowerBound = 4;
optional string upperBound = 5;
}
Now obviously the lowerBound can just be the string truncated (at a utf8 character boundary!)
to at most 1024 bytes. The upperBound is the same with the last code point increased by one.
In the StringStatisticsImpl, I'd keep two boolean flags as to whether it is a real value
or an approximation for minimum or maximum. The value comparison is the same, since unless
the current value is less than the lower bound, it won't change the lower bound and the same
is true for the upper bound. If the new minimum/maximum is not truncated, the corresponding
wasTruncated flag should be cleared. When merging, the flag follows the value. In the corner
case of two identical values where one was truncated, the nontruncated one is the result.
We should end up with four methods for each:
* String getMinimum();
* String getLowerBound();
* String getMaximum();
* String getUpperBound();
If we only have a lower bound, getMinimum should be null and the same with upper bound
and getMaximum. getLowerBound and getUpperBound should match getMinimum and getMaximum, if
no truncation was done.

