hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Thu, 23 Apr 2009 08:19:47 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701839#action_12701839

Zheng Shao commented on HIVE-352:

@Yongqiang: I found a place in the SequenceFile reader test that may improve the performance
a lot - BytesRefWritable.readFields is creating a new array for each row!! This is bad and
I would say this is not a fair comparison between RCFile and SequenceFile.

There are 3 ways to fix BytesRefWritable:
1. Add a boolean member "owned", set it to true every time we create an array in readFields,
and don't create another array if owned is true and the current record is equal or smaller
than the current owned array. Also, set it to false every time set(...) is called.
2. Directly change the semantics of readFields - we always reuse the bytes array if length
of bytes array is equal or greater to the current record, otherwise create a new one. This
is OK because for people who uses set(...) they probably won't use readFields at all. Of course,
we need to put a comment at readFields and set() says readFields will corrupt the array, so
don't call readFields.
3. Use a completely different class hierarchy.

I would prefer to do 2 since it's the simplest way to go.

I hope this will improve the sequencefile read performance a lot, and give RCFile and SeqFile
a fair comparison.

Also, you might want to modify the write code to use the same logic - reuse the bytes array
if possible. Then the writes will be much faster as well.

> Make Hive support column based storage
> --------------------------------------
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message