hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HIVE-352) Make Hive support column based storage
Date Thu, 23 Apr 2009 10:14:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701876#action_12701876
] 

Zheng Shao edited comment on HIVE-352 at 4/23/09 3:14 AM:
----------------------------------------------------------

Running Yongqiang's tests with hadoop native library, using DefaultCodec for both RCFile and
SequenceFile. The file is on local file system.

It seems RCFile's read performance is around 2 times of that of SequenceFiles, probably because
we do bulk decompression and one less copy of data.
This result looks reasonable. 
{code}
Write RCFile with 80 random string columns and 100000 rows cost 25464 milliseconds. And the
file's on disk size is 91874941
Write SequenceFile with 80 random string columns and 100000 rows cost 35711 milliseconds.
And the file's on disk size is 102521005
Read only one column of a RCFile with 80 random string columns and 100000 rows cost 594 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 100000 rows
cost 600 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2227 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4343 milliseconds.
{code}

This is the result using GzipCodec. Not much difference.
{code}
Write RCFile with 80 random string columns and 100000 rows cost 26358 milliseconds. And the
file's on disk size is 91931563
Write SequenceFile with 80 random string columns and 100000 rows cost 35802 milliseconds.
And the file's on disk size is 102528154
Read only one column of a RCFile with 80 random string columns and 100000 rows cost 593 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 100000 rows
cost 626 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2401 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4601 milliseconds.
{code}

Each column is a random string length uniformly from 0 to 30, containing random uppercase
and lowercase alphabets.


      was (Author: zshao):
    Running Yongqiang's tests with hadoop native library, using DefaultCodec for both RCFile
and SequenceFile.

It seems RCFile's read performance is around 2 times of that of SequenceFiles, probably because
we do bulk decompression and one less copy of data.
This result looks reasonable. 
{code}
Write RCFile with 80 random string columns and 100000 rows cost 25464 milliseconds. And the
file's on disk size is 91874941
Write SequenceFile with 80 random string columns and 100000 rows cost 35711 milliseconds.
And the file's on disk size is 102521005
Read only one column of a RCFile with 80 random string columns and 100000 rows cost 594 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 100000 rows
cost 600 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2227 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4343 milliseconds.
{code}

This is the result using GzipCodec. Not much difference.
{code}
Write RCFile with 80 random string columns and 100000 rows cost 26358 milliseconds. And the
file's on disk size is 91931563
Write SequenceFile with 80 random string columns and 100000 rows cost 35802 milliseconds.
And the file's on disk size is 102528154
Read only one column of a RCFile with 80 random string columns and 100000 rows cost 593 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 100000 rows
cost 626 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2401 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4601 milliseconds.
{code}

Each column is a random string length uniformly from 0 to 30, containing random uppercase
and lowercase alphabets.

  
> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message