hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-352) Make Hive support column based storage
Date Tue, 17 Mar 2009 16:42:50 GMT

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682716#action_12682716
] 

Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------

thanks for taking this on. this could be pretty awesome.

traditionally the arguments for columnar storage has been limited 'scan bandwidth' and compression.
In practice - we see that scan bandwidth has two components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data

most columnar stores optimize for both (especially because in shared disk architectures -
#1 is at premium). However - our limited experience says is that in Hadoop #1 is almost infinite.
#2 can still be a bottleneck though. (it is possible that this observation applies because
of high hadoop/java compute overheads - regardless - this seems to be reality).

Given this - i like the idea of a scheme where columns are stored as independent streams inside
a block oriented file format (each file block contains a set of rows, however - the organization
inside blocks is by column). This does not optimize for #1 - but does optimize for #2 (potentially
in conjunction with Hive's interfaces to get one column at a time from IO Libraries). It also
gives us nearly equivalent compression.

(The alternative scheme of having  different file(s) per column is also complicated by the
fact that locality is almost impossible to ensure and there is no reasonable ways of asking
hdfs to colocate different file segments in the near future).

--

i would love to understand how you are planning to approach this. will we still use sequencefiles
as a container - or should we ditch it? (it wasn't a great fit for hive - given that we don't
use the key field - but the best thing we could find). We have seen that having a number of
open codecs can hurt in memory usage - that's one open question for me - can we actually afford
to open N concurrent compressed streams (assuming each column is stored compressed separately).

It also seems that one could define a ColumnarInputFormat/OutputFormat as a generic api with
different implementations and different pluggable containers underneath - and a scheme of
either file per column or columnar in a block approach. in that sense we could build something
more generic for hadoop (and then just make sure that hive's lazy serde uses the columnar
api for data access - instead of the row based api exposed by current inputformat).

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message