hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From He Yongqiang <heyongqi...@software.ict.ac.cn>
Subject Re: [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Sun, 19 Apr 2009 01:42:24 GMT

Agreed.
Can we have both?
1 is absolutely better for high selectivity filter clauses. With 2, we can
skip loading unnecessary (compressed) columns into memory.
I have done a simple RCFile perform test in my local single machine. It
seems RCFile perform much better in reading than block-compressed sequence
file. I think the performance improvements should attribute to the skip
strategy.
The below is a coarse results of comparing RCFile with SequenceFile (in
local):
{noformat}
Write RCFile with 10 random string columns and 100000 rows cost 9851
milliseconds. And the file's on disk size is 50527070

Read only one column of a RCFile with 10 random string columns and 100000
rows cost 448 milliseconds.

Write SequenceFile with 10  random string columns and 100000 rows cost 18405
milliseconds. And the file's on disk size is 52684063

Read SequenceFile with 10  random string columns and 100000 rows cost 9418
milliseconds.

Write RCFile with 25 random string columns and 100000 rows cost 15112
milliseconds. And the file's on disk size is 126262141

Read only one column of a RCFile with 25 random string columns and 100000
rows cost 467 milliseconds.

Write SequenceFile with 25  random string columns and 100000 rows cost 45586
milliseconds. And the file's on disk size is 131355387

Read SequenceFile with 25  random string columns and 100000 rows cost 22013
milliseconds.
{noformat}

I will post more detailed test results together with next patch.

On 09-4-19 上午8:03, "Zheng Shao (JIRA)" <jira@apache.org> wrote:

> 
>     [ 
> https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.
> system.issuetabpanels:comment-tabpanel&focusedCommentId=12700542#action_127005
> 42 ] 
> 
> Zheng Shao commented on HIVE-352:
> ---------------------------------
> 
> 2 major approaches for the RCFileFormat to work are:
> 1. Lazy deserialization (and decompression): The Objects passed around in the
> Hive Operators can be wrappers of handles to underlying decompression streams
> which will decompress the data on the fly.
> 2. Column-hinting: Let Hive tell the FileFormat which columns are neede and
> which are not.
> 
> There is a major benefit of Option 1 in a common case like this:
> {code}
> SELECT key, value1, value2, value3, value4 from columnarTable where key =
> 'xxyyzz';
> {code}
> if the selectivity of "key = 'xxyyzz'" is really high, we will end up
> decompressing very few blocks of value1 to value4.
> This is not possible with Option 2.
> 
> 
>> Make Hive support column based storage
>> --------------------------------------
>> 
>>                 Key: HIVE-352
>>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>>             Project: Hadoop Hive
>>          Issue Type: New Feature
>>            Reporter: He Yongqiang
>>            Assignee: He Yongqiang
>>         Attachments: hive-352-2009-4-15.patch, hive-352-2009-4-16.patch,
>> hive-352-2009-4-17.patch, HIve-352-draft-2009-03-28.patch,
>> Hive-352-draft-2009-03-30.patch
>> 
>> 
>> column based storage has been proven a better storage layout for OLAP.
>> Hive does a great job on raw row oriented storage. In this issue, we will
>> enhance hive to support column based storage.
>> Acctually we have done some work on column based storage on top of hdfs, i
>> think it will need some review and refactoring to port it to Hive.
>> Any thoughts?



Mime
View raw message