orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt McCline (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-209) Improve Decimal Serialization/Deserialization
Date Wed, 12 Jul 2017 22:41:00 GMT
Matt McCline created ORC-209:

             Summary: Improve Decimal Serialization/Deserialization
                 Key: ORC-209
                 URL: https://issues.apache.org/jira/browse/ORC-209
             Project: ORC
          Issue Type: Bug
            Reporter: Matt McCline
            Assignee: Matt McCline
            Priority: Critical

Currently, HiveDecimal is serialized in ORC in a special binary bytes format as the "value"
stream and a secondary stream with the scale for each decimal.  The decimal has trailing zeroes
removed and the scale can vary for each decimal.  This format has CPU and storage space (i.e.
compression) inefficiencies.

The decimal type has a fixed precision and scale.  Gopal/Prasanth/Owen have suggested storing
the decimals with the trailing zeroes (so the scale is a constant value for the file from
the metadata) and store it as an integer stream that can benefit from run-length encoding
compression, etc.

This message was sent by Atlassian JIRA

View raw message