hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
Date Tue, 16 Jul 2013 21:58:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710342#comment-13710342
] 

Prasanth J commented on HIVE-4123:
----------------------------------

This patch improves upon the existing run length encoding for integers. As mentioned in the
description, it uses bit packing for more tighter compression, improved run length and delta
encoding and also it supports longer runs. 

This patch supports the following light weight compression techniques

*SHORT_REPEAT*
*DIRECT*
*PATCHED_BASE*
*DELTA*


The description and format for these types are as below:

*SHORT_REPEAT:* Used for short repeated integer sequences.
* 1 byte header
** 2 bits for encoding type
** 3 bits for bytes required for repeating value
** 3 bits for repeat count (MIN_REPEAT + run length)
* Blob - repeat value (fixed bytes)

*DIRECT:* Used for random integer sequences whose number of bit requirement doesn't vary a
lot.
* 2 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
* Blob - fixed width * run length bits long

*PATCHED_BASE:* Used for random integer sequences whose number of bit requirement varies beyond
a threshold.
* 4 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
** 3rd byte
*** 3 bits for bytes required for base value
*** 5 bits for patch width
** 4th byte
*** 3 bits for patch gap width
*** 5 bits for patch length
* Base value - base width * 8 bits
* Data blob - fixed width * run length
* Patch blob - (patch width + patch gap width) * patch length

*DELTA:* Used for monotonically increasing or decreasing sequences, sequences with fixed delta
values or long repeated sequences.
* 2 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
* Base value - encoded as varint
* Delta base (only long fixed delta runs) - zigzag encoded
* Delta blob (variable delta runs) - zigzag encoded

I have tested this new implementation with the current implementation and the comparison of
compression ratio between the existing implementation and new implementation is shown in the
attached excel sheet for various real world datasets. As seen from the comparison sheet the
new implementation gives significant improvement in compression ratio over the existing implementation
for most of the cases. 

NOTE: This patch is generated against the trunk after applying HIVE-4724 patch. 

[~owen.omalley] can you please review this patch and let me know your review comments? Also
let me know if I need to upload this patch to phabricator.


                
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
>                 Key: HIVE-4123
>                 URL: https://issues.apache.org/jira/browse/HIVE-4123
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message