orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tadahito Kobayashi (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-468) Fix incorrect documentation for nanoseconds stream encoding
Date Wed, 13 Feb 2019 05:47:00 GMT
Tadahito Kobayashi created ORC-468:
--------------------------------------

             Summary: Fix incorrect documentation for nanoseconds stream encoding
                 Key: ORC-468
                 URL: https://issues.apache.org/jira/browse/ORC-468
             Project: ORC
          Issue Type: Bug
          Components: documentation
            Reporter: Tadahito Kobayashi


According to ORC spec doc, "1000 nanoseconds would be serialized as 0x0b and 100000 would
be serialized as 0x0d."
However, the actual encoding result are: formatNano(1000) = 0x0a and formatNano(100000) =
0x0c.

How about changing the document as below?

"Because the number of nanoseconds often has a large number of trailing zeros, the number
has trailing decimal zero digits removed and the last three bits are used to record how many
zeros were removed {color:#FF0000}if the trailing zeros are more than 2{color}. Thus 1000
nanoseconds would be serialized as {color:#FF0000}0x0a{color} and 100000 would be serialized
as {color:#FF0000}0x0c{color}."


Below is my test and result to confirm nanoseconds encodings.

 
{code:java}
// this is the ORC's serialization code in ColumnWriter.cc, ORC encodes nanoseconds by this
function.
// https://github.com/apache/orc/blob/master/c%2B%2B/src/ColumnWriter.cc#L1669
static int64_t formatNano(int64_t nanos) {
 if (nanos == 0) {
 return 0;
 }
 else if (nanos % 100 != 0) {
 return (nanos) << 3;
 }
 else {
 nanos /= 100;
 int64_t trailingZeros = 1;
 while (nanos % 10 == 0 && trailingZeros < 7) {
 nanos /= 10;
 trailingZeros += 1;
 }
 return (nanos) << 3 | trailingZeros;
 }
}
void main()
{
 for (int nano = 1; nano <= 1000000; nano *= 10) {
 printf("formatNano(%d) = 0x%02x\n", nano, formatNano(nano));
 }
}
{code}
 

The result:
{code:java}
formatNano(1) = 0x08
formatNano(10) = 0x50
formatNano(100) = 0x09
formatNano(1000) = 0x0a
formatNano(10000) = 0x0b
formatNano(100000) = 0x0c
formatNano(1000000) = 0x0d{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message