hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krishna Kumar (JIRA)" <>
Subject [jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files
Date Wed, 06 Apr 2011 17:29:05 GMT


Krishna Kumar commented on HIVE-2097:

Comment hijacked from HIVE-2065:

He Yongqiang added a comment - 31/Mar/11 23:13

we examined column groups, and sort the data internally based on one column in one column
group. (But we did not try different compressions across column groups.) Tried this with 3-4
tables, and we see ~20% storage savings on one table compared the previous RCFile. The main
problems for this approach is that it is hard to find out the correct/most efficient column
group definitions.
One example, table tbl_1 has 20 columns, and user can define:


This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group
based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4,
col_15,col_16 into another column group, and reorder this column group based on sorting col_4,
and finally put all other columns into the default column group with original order.
And should be easy to allow different compression codec for different column groups.

The main block issue for this approach is have a full set of utils to find out the best column
group definition.

> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>                 Key: HIVE-2097
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
> 1. More efficient serialization/deserialization based on type-specific and storage-specific
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message