orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Douglas Drinka (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-161) Create a new column type that run-length-encodes decimals
Date Thu, 16 Mar 2017 15:49:41 GMT
Douglas Drinka created ORC-161:

             Summary: Create a new column type that run-length-encodes decimals
                 Key: ORC-161
                 URL: https://issues.apache.org/jira/browse/ORC-161
             Project: ORC
          Issue Type: Wish
          Components: encoding
            Reporter: Douglas Drinka

I'm storing prices in ORC format, and have made the following observations about the current
decimal implementation:
- The encoding is inefficient: my prices are a walking-random set, plus or minus a few pennies
per data point. This would encode beautifully with a patched base encoding.  Instead I'm averaging
4 bytes per data point, after Zlib.
- Everyone acknowledges that it's nice to be able to store huge numbers in decimal columns,
but that you probably won't.  Presto, for instance, has a fast-path which engages for precision
of 18 or less, and decodes to 64-bit longs, and then a slow path which uses BigInt.  I anticipate
the majority of implementations fit the decimal(18,6) use case.
- The whole concept of precision/scale, along with a dedicated scale per data point is messy.
 Sometimes it's checked on data ingest, other times its an error on reading, or else it's
cast (and rounded?)

I don't propose eliminating the current column type.  It's nice to know there's a way to store
really big numbers (or really accurate numbers) if I need that in the future.

But I'd like to see a new column that uses the existing Run Length Encoding functionality,
and is limited to 63+1 bit numbers, with a fixed precision and scale for ingest and query.

I think one could call this FixedPoint.  Every number is stored as a long, and scaled by a
column constant.  Ingest from decimal would scale and throw or round, configurably.  Precision
would be fixed at 18, or made configurable and verified at ingest.  Stats would use longs
(scaled with the column) rather than strings.

Anyone can opt in to faster, smaller data sets, if they're ok with 63+1 bits of precision.
 Or they can keep using decimal if they need 128 bits.  Win/win?

This message was sent by Atlassian JIRA

View raw message