hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Wed, 25 Mar 2009 09:16:53 GMT


He Yongqiang commented on HIVE-352:

Thanks, Joydeep and Prasad.
First i would like to make an update to the recent work:
I had implemented an initial RCFile which was just a wrapper of SequenceFile, and it relied
on Hadoop-5553. Since it seems Hadoop-5553 will not be resolved, I have implemented another
RCFile, which copies many code form SequenceFile( especially the Writer code), and provides
the same on-disk data layout as SequenceFile.

Here is a draft description of the new RCFile:
1) Only record compression or no compression at all. 
    In B2.2 we store a bunch of raw rows into one record in a columnar way. So there is no
need for block compression, because block compression will decompress all the data.
2) In-record compression.
    If the writer is created with compress flag, then the value part in one record is compressed
but with a column compression style. The layout is like this:

Record length
Key length
{the below is the Key part} 
column_1_ondisk_length(vint),column_1_row_1_value_plain_length, column_1_row_2_value_plain_length,....
column_2_ondisk_length(vint),column_2_row_1_value_plain_length, column_2_row_2_value_plain_length,....
{the end of the key part}
{the begin of the value part}
Compressed data or plain data of [column_1_row_1_value, column_1_row_2_value,....]
Compressed data or plain data of [column_2_row_1_value, column_2_row_2_value,....]
{the end of the value part}

The key part: KeyBuffer
The value part : ValueBuffer

3) the reader

It now only provides 2 API:
next(LongWritable rowID): returns the next rowid number. I think it should be refined, because
the rowid maybe not real rowid, and it is only the already passed rows from the beginning
of the reader.

List<Bytes> getCurrentRow() will return all the columns raw bytes of one row. Because
the reader can let use specify the column ids which should be skipped, so the returned List<Bytes>
only contains the unskipped columns bytes. Maybe it is better to store a NullBytes in the
returned list to represent a skipped column.

> Make Hive support column based storage
> --------------------------------------
>                 Key: HIVE-352
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message