hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Thu, 19 Mar 2009 07:17:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683348#action_12683348

Joydeep Sen Sarma commented on HIVE-352:

>B2.2 is easier to implement, because we don't have the problem of splitting different
columns of the same block into multiple mappers.

for B2.1 - we may be able to control when sequencefile writes out sync markers (or at least
we should investigate if that's easy enough to do by extending SequenceFile). the advantage
of avoiding reading specific columns seems pretty significant.

OTOH - one can also easily imagine that  SequenceFile does not copy data into a BytesWritable
- rather that we have a special Writable structure such that when the read on it is invoked
- it just copies the reference to the underlying byte buffer. that way there are no copies
of data in sequencefile reader and the application (in this case the columnar format reader)
- is able to skip to the relevant sections of data without touching the irrelevant columns.
if we do it this way - B2.2 has no performance downside. 

regarding the compression related questions raised by Yongqiang - it seems to me that trying
out the most generic compression algorithm (gzip) is better - trying to specify or infer best
compression technique per column much harder and something that can be done later. one thing
we could do to mitigate the number of open codecs is to simply accumulate all the data uncompressed
in a buffer per column and then do the compression in one shot at the end (once we think enough
data is accumulated) using just one codec object.  this obviously seems non optimal from the
point of view of having to scan data multple times - OTOH - there were known issues with older
versions of hadoop with lots of open codecs. 

> Make Hive support column based storage
> --------------------------------------
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message