hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Krishna Kumar (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files
Date Mon, 28 Nov 2011 14:56:40 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158484#comment-13158484
] 

Krishna Kumar commented on HIVE-2097:
-------------------------------------

Thanks Alex for the suggestions.

Just to be sure we are on the same page, I believe you are talking about #3 approach above
given in the description which aligns with the ideas in the comment from He Yongqiang. I have
been working on implementing #1 and #2 currently.

Re #3 approaches, column grouping and row reordering are the general idea, but I do not understand
your point re column selectivity. Why should selectivity play a role here where any grouping/reordering
is done for better compression? There are two effects which we can exploit for better compression
within column grouping (a) when the values in the two columns are similar and (b) where the
values are correlated, that is, using conditional probabilities for better compression. In
either case, my hope was that we would be able to create type-specific compressors for structs/maps
etc which can exploit these features, i.e., a struct/map acts as a column group for compression
purposes.


                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific
knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message