hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Thu, 23 Apr 2009 08:13:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701837#action_12701837
] 

He Yongqiang commented on HIVE-352:
-----------------------------------

Thanks, Zheng.
>>0. Did you try that with hadoop 0.17.0? "ant -Dhadoop.version=0.17.0 test" etc.
yes.
>>1. Can you add your tests to ant, or post the testing scripts so that everybody can
easily reproduce the test results that you have got?
I will do that with next patch
>>2. For DistributedFileSystem, how big is the cluster? Is the file (the file size is
small so it's clearly a single block) local?
The cluster is of six nodes. The file is not local. The test was run on my local machine,
and use HDFS.
>>3. It seems SequenceFile's compression is not as good as RCFile, although the data
is the same and also random. What is the exact record format in sequencefile? Did you >>put
delimitors or you put length of Strings?
yes, it has length of Strings.However, RCFile also has the length of strings
>>The approach of store compressed data at creation, and do bulk decompression at reading
is not practical because it's very easy to go out of memory.
Yes, I encountered Out of memory error. So i added some trick in RCFile.Writer's append. Like
{noformat}
if ((columnBufferSize + (this.bufferedRecords * this.columnNumber * 2) > COLUMNS_BUFFER_SIZE)
          || (this.bufferedRecords >= this.RECORD_INTERVAL)) {
        flushRecords();
      }
{noformat}

>>We've done BULK, and it showed great performance (1.6s to read and decompress 40MB
local file), but I suspect the compression ratio will be lower than NONBULK.
>>Can you compare the compression ratio of BULK and NONBULK, given different buffer
sizes and column numbers?
BULK and NONBULK( they mean decompress) are only for Read, they have nothing to do with Write,
so I guess it will not influence compression ratio.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message