hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Sun, 19 Apr 2009 07:14:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700556#action_12700556
] 

He Yongqiang commented on HIVE-352:
-----------------------------------

Agreed.
Can we have both?
1 is absolutely better for high selectivity filter clauses. With 2, we can skip loading unnecessary
(compressed) columns into memory. 
I have done a simple RCFile perform test in my local single machine. It seems RCFile perform
much better in reading than block-compressed sequence file. I think the performance improvements
should attribute to the skip strategy.
The below is a coarse results of comparing RCFile with SequenceFile (in local):
{noformat}
Write RCFile with 10 random string columns and 100000 rows cost 9851 milliseconds. And the
file's on disk size is 50527070
Read only one column of a RCFile with 10 random string columns and 100000 rows cost 448 milliseconds.
Write SequenceFile with 10  random string columns and 100000 rows cost 18405 milliseconds.
And the file's on disk size is 52684063
Read SequenceFile with 10  random string columns and 100000 rows cost 9418 milliseconds.
Write RCFile with 25 random string columns and 100000 rows cost 15112 milliseconds. And the
file's on disk size is 126262141
Read only one column of a RCFile with 25 random string columns and 100000 rows cost 467 milliseconds.
Write SequenceFile with 25  random string columns and 100000 rows cost 45586 milliseconds.
And the file's on disk size is 131355387
Read SequenceFile with 25  random string columns and 100000 rows cost 22013 milliseconds.
{noformat}

I will post more detailed test results together with next patch.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, hive-352-2009-4-17.patch,
HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message