hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "he yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Tue, 17 Mar 2009 18:16:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682740#action_12682740

he yongqiang commented on HIVE-352:

Thanks, Joydeep Sen Sarma. Your feedback is really important.

1. store schema.  block-wise column store or one file per column.
Our current implementation stores each column in one file. And the most annoying part for
us, just as you said, is that currently and even in near future, hdfs does not support to
colocate different file segements for columns in a same table.  So some operations need to
fetch data from a new file(like a mapside hash join, a join with CompositeInputFormat) or
need to add new map reduce job to merge data together.  Some operations are pretty good for
I think block-wise column is a good point. I will try to imprement it nearly. With different
columns collocated in a single block, some operations do not need a reduce part(which is really

2. compression
With different columns in different files, some light weight compressions,such as RLE, dictionay
and bit vector encoding, can be used. One benefit of these light weight compression algorithms
is that some operations does not need to decompression the data.
If we implement the block-wise column storage, should we also need to specify the light weight
compression algorithm for each column or we choose one( like RLE) internally if the data is
of good cluster nature? Since dictionary and bit vector should also be supported, the comlumns
with these compression algorithms should be also placed in the block-wise columnar file? I
think placing these columns in seperate files can be handled more easily? But i do not know
whether it can fit into Hive. I am new to Hive.
having a number of open codecs can hurt in memory usage
currently I can not think up a solution to avoid this for column per file store.

3.file format
yeah. i think we need to add new file formats and their corresponding InputFormats. Currently,
we have implemented the VFile(Value File, we do not need to store a key part), and BitMapFile.
We have not implemented a DictionayFile, instead we use a header file for VFile to store dictionary
entries. The header file for VFile is not needed for some columns and sometimes it is must.

I think the refactor of file formats should be the start for this issue.

Thanks again.

> Make Hive support column based storage
> --------------------------------------
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: he yongqiang
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message