orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Douglas Drinka (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ORC-144) PATCHED BASE Documentation Issues
Date Wed, 08 Feb 2017 22:53:41 GMT
Douglas Drinka created ORC-144:

             Summary: PATCHED BASE Documentation Issues
                 Key: ORC-144
                 URL: https://issues.apache.org/jira/browse/ORC-144
             Project: Orc
          Issue Type: Bug
          Components: documentation
            Reporter: Douglas Drinka
            Priority: Minor

The documentation for Patched Base encoding has two issues.

First is a repeat of "Data values (W * L bits padded to the byte)..." in the data field description.

Second is in the example given.  The sample data for all the other encoding formats actually
trigger their encoder based on the logic in the java code.  However this example sequence
is too short to trigger both the 90% cutoff for non-rebased data (1.0-.9)*10 = 0.99999999999999978
which floors to 0, and the 95% cutoff of rebased data.  At least 20 values are needed for
a single patch to occur.

I propose the following sequence:
[2030, 2000, 2020, 1000000, 2040, 2050, 2060, 2070, 2080, 2090, 2100, 2110, 2120, 2130, 2140,
2150, 2160, 2170, 2180, 2190]

Which encodes to [0x8e, 0x13, 0x2b, 0x21, 0x07, 0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32,
0x3c, 0x46, 0x50, 0x5a, 0x64, 0x6e, 0x78, 0x82, 0x8c, 0x96, 0xa0, 0xaa, 0xb4, 0xbe, 0xfc,

Then in the description the wording should be "a length of 20 (19)".

These samples were critical for me to verify my code, and I appreciated them being provided,
particularly since I didn't find any unit tests available in the java code to directly compare
byte outputs of the encoders.

This message was sent by Atlassian JIRA

View raw message