hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Tue, 24 Mar 2009 15:59:53 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688725#action_12688725
] 

Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------

it's not clear to me that we need to ditch Sequencefile for the short term. Like Prasad said
- we can impose our own structure on the sequencefile record which can allow skipping unnecessary
data.

we cannot use record compression obviously. There are two approaches you can take:

1. keep using a BytesWritable (or Text) for the 'value' part and impose ur own layout inside
this so that the ColumnarSerDe only needs to seek to and decompress the relevant column).
This does require one copy of the entire data from sequencefile 'value' to the BytesWritable
2. use the Hadoop serializer framework (see src/core/org/apache/hadoop/io/serializer) - and
get Hadoop to pass u the input stream directly (for reading the 'value' part). The custom
deserializer can then be configured via Hive's plan to only copy out the bytes that are of
interest to the Hive plan.

#2 is obviously more complicated - and in practice straighline data copies of hot data is
not that expensive (since Hadoop has already done a crc check on all this data and it's typically
already in processor caches and fast to scan again).

So i would try out #1 to begin with. 

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message