orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dain <...@git.apache.org>
Subject [GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...
Date Thu, 12 Apr 2018 17:38:56 GMT
Github user dain commented on a diff in the pull request:

    https://github.com/apache/orc/pull/245#discussion_r181164570
  
    --- Diff: site/_docs/encodings.md ---
    @@ -109,10 +109,20 @@ DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
     Decimal was introduced in Hive 0.11 with infinite precision (the total
     number of digits). In Hive 0.13, the definition was change to limit
     the precision to a maximum of 38 digits, which conveniently uses 127
    -bits plus a sign bit. The current encoding of decimal columns stores
    -the integer representation of the value as an unbounded length zigzag
    -encoded base 128 varint. The scale is stored in the SECONDARY stream
    -as an signed integer.
    +bits plus a sign bit.
    +
    +DIRECT and DIRECT_V2 encodings of decimal columns stores the integer
    +representation of the value as an unbounded length zigzag encoded base
    +128 varint. The scale is stored in the SECONDARY stream as an signed
    +integer.
    +
    +In ORC 2.0, DECIMAL encoding is introduced and totally remove scale
    +stream as all decimal values use the same scale. When precision is
    +no greater than 18, decimal values can be fully represented by DATA
    +stream which stores 64-bit signed integers. When precision is greater
    +than 18, we use a 128-bit signed integer to store the decimal value.
    +DATA stream stores the higher 64 bits and SECONDARY stream holds the
    +lower 64 bits. Both streams use signed integer RLE v2.
    --- End diff --
    
    Why split the data across two streams?  This means 2 IOs (or one large coalesced IO) to
read the values (assuming no nulls).  Instead, can't we put all 128 bits in one stream?


---

Mime
View raw message