hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Thu, 23 Apr 2009 22:40:30 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702146#action_12702146
] 

He Yongqiang commented on HIVE-352:
-----------------------------------

>>Can we also get some numbers on the amount of memory usage? 
I rerun the test(the same test as Zheng's,but with no native codec) in my local using local
fs and DefaultCodec, and it read all columns of a rc file with 80 columns and 100000 rows(size:91849881
Bytes).
And the maximum memory usages is shown below( i do couple of command 'ps -o vsz,rss,rsz,%mem
-p 549' every minute),
     VSZ    RSS    RSZ %MEM
  766732  63472  63472 -3.0
BTW, my physical memory is 3GB.

>>Was this just a hdfs read or the measurement of a Hive query?
The test was just a file read test.

However, with no native codec and my results shows a much diff from Zheng's in that SequenceFile
does much worse in my test.
{noformat}
Write RCFile with 80 random string columns and 100000 rows cost 30643 milliseconds. And the
file's on disk size is 91849881
Write SequenceFile with 80 random string columns and 100000 rows cost 62034 milliseconds.
And the file's on disk size is 102521005
Read only one column of a RCFile with 80 random string columns and 100000 rows cost 703 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 100000 rows
cost 526 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 3131 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 47876 milliseconds.
{noformat}

Why native codec matters so much for sequece file and not for RCFile? It should influence
both RCFile and SequenceFile in the same way.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message