orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From omalley <...@git.apache.org>
Subject [GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Date Fri, 13 Apr 2018 22:51:09 GMT
Github user omalley commented on a diff in the pull request:

    https://github.com/apache/orc/pull/245#discussion_r181456234
  
    --- Diff: site/_docs/file-tail.md ---
    @@ -249,12 +249,25 @@ For booleans, the statistics include the count of false and true
values.
     }
     ```
     
    -For decimals, the minimum, maximum, and sum are stored.
    +For decimals, the minimum, maximum, and sum are stored. In ORC 2.0,
    +string representation is deprecated and DecimalStatistics uses integers
    +which have better performance.
     
     ```message DecimalStatistics {
      optional string minimum = 1;
      optional string maximum = 2;
      optional string sum = 3;
    +  message Int128 {
    --- End diff --
    
    Let's pull the Int128 out of DecimalStatistics. We will likely use it other places.
    
    One concern with this representation is that -1 is pretty painful. You'll get highBits
= -1, lowBits = -1, which will only take 1 byte for highBits, but 9 bytes for lowBits (+ the
4 bytes of field identifiers & message length) = 14 bytes total. Another alternative is
to use the zigzag encoding for the combined 128 bit value:
    
      optional uint64 minLow = 4;
      optional uint64 minHigh = 5;
    
    p <= 18:
      minLow = zigzag(min)
      minHigh = 0
    
    p > 18:
      minLow = low bits of zigzag(min)
      minHigh = high bits of zigzag(min)
    
    That would have a representation of 1 byte each for minLow and minHigh + 2 bytes for field
identifier = 4. If we leave the Int128 level that would add an additional + 2 bytes. 


---

Mime
View raw message