hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Fri, 24 Apr 2009 07:18:31 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702267#action_12702267

Zheng Shao commented on HIVE-352:


The reason that native codec matters more for SequenceFile is probably because seqfile is
using compression differently from rcfile, for example, incremental compression/decompression.

I had a test on our data set which mainly contains around 40 columns of string, the length
of string is usually fixed for that column, from length 1 to 10. The result is that seqfile
is much smaller than rcfile - seqfile is only around 55% of the size of rcfile. However inside
the rcfile I see a lot of repeated bytes - that's the length of the field for each row.  Also
rcfile is slower probably because it's writing out more data than seqfile.

1. Can you also compress the field length columns? I tried to compress the rcfile again using
gzip command line, and it becomes 41% of the current size - this is a lot smaller than the
seqfile, which means in general, RCfile can save a lot of space because it's easier for compression
algorithm to compress the length and the content of each column separately.

2. Also, I remember you changed the the compression to be incremental, so the current solution
is a mix of BULK and NONBULK as I described above. which has memory problems  Since as we
discussed we would like to leave the NONBULK mode for later because of the amount of additional
work, can you change the code back to BULK compression?  There is probably a performance loss
due to incremental compression, which can be avoided by bulk compression.

> Make Hive support column based storage
> --------------------------------------
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message