hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
Date Sun, 21 Jul 2013 20:04:49 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714795#comment-13714795
] 

Owen O'Malley commented on HIVE-4123:
-------------------------------------

More comments:
* I don't see why bitpack reader/writer are more than static methods that read/write to the
underlying stream. So I would have expected a method like writeInts(long[] data, int offset,
int length, int numBits, OutputStream stream) and the corresponding one for reading.
* Utils.bytesToLongBE should take an input stream rather than a byte[].
* In IntegerCompressionReader:
** I'd write a method to translate the int into an opcode rather than use ordinal.
** It is probably worth remembering that you are in a repeat, so that you don't need to copy
the value N times in short repeat.
** It may be easier to loop through the base values and then run through the patches. You
might even do three loops: unpack the main values, unpack the patches, add the base to each
value.
** For patched based only the base is zigzag encoded. The rest of the values are always positive.
** For delta only the base and base delta are zigzag encoded. 
* In IntegerCompressionWriter:
** You should give more comments about the patched base encoding.
** Instead of sorting for the percentiles, you could keep a count of how many values use each
number of bits.
** Replace the commented out printlns with LOG.debug surrounded by LOG.ifDebugEnabled
** flush should use if/then/else to prevent writing the data twice
** the constructor should probably call clear rather than risk having the default values be
different
** in write, just copy the data with system.arraycopy instead of cloning the array
** write should track whether the values are monotonically increasing or decreasing so that
we know if delta applies
** there is a lot of duplication of effort in determine encoding
** if the sequence is both increasing and decreasing, it is constant and we should either
use short literal or delta depending on the length
** delta encoding should return before doing the percentile work
** 
* How much unit test coverage do you have of the new code?
* Have you run the encoder/decoder round trip over the github data to test it?


                
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
>                 Key: HIVE-4123
>                 URL: https://issues.apache.org/jira/browse/HIVE-4123
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Prasanth J
>         Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx
>
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message