hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Wed, 29 Apr 2009 05:41:30 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703993#action_12703993
] 

Zheng Shao commented on HIVE-352:
---------------------------------

The following numbers are all for 128MB gzip compressed block (for seqfile, and 20% smaller
for rcfile because of difference compression ratio)
A. Read from seqfile + Write to seqfile: 2m 05s
B. Read from seqfile + Write to rcfile: 2m 45s
C. Read from rcfile + Write to seqfile: 2m 20s
D. Read from rcfile + Write to rcfile: 3m 00s

@Joydeep: The good compression ratio is mainly because we are compressing column length and
column data (without delimiters) separately.  In an earlier experiment I did, column-based
compression only showed 7-8% improvements because I was compressing column data with delimiters.

@Yongqiang: Did you turn on native compression when testing?

Some performance improvement tips from the profiling:
1. BytesRefArrayWritable to use Java Array (BytesRefWritable[]) instead of List<BytesRefWritable>
2. RCFile$Writer.columnBuffers to use Java Array(ColumnBuffer[]) instead of List<ColumnBuffer>
3. Add a method in BytesRefArrayWritable to return the BytesRefWritable[] so that RCFile$Writer.append
can operator on it directly.
1-3 will save us 10-15 seconds from B and D.
4. RCFIle$Writer$ColumnBuffer.append should directly call DataOutputStream.write and WritableUtils.writeVLong
     public void append(BytesRefWritable data) throws IOException {
        data.writeDataTo(columnValBuffer);
        WritableUtils.writeVInt(valLenBuffer, data.getLength());
      }
4 will save 5-10 seconds from B and D.

Following the same route, if there are any Lists that the number of elements do not usually
change, we should use Java Array ([]) instead of List.

Yongqiang, can you do step 1-4 and try to replace List with Array?


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, HIve-352-draft-2009-03-28.patch,
Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message