hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-352) Make Hive support column based storage
Date Mon, 27 Apr 2009 11:51:30 GMT

     [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Yongqiang updated HIVE-352:
------------------------------

    Attachment: hive-352-2009-4-27.patch

hive-352-2009-4-27.patch changed back to bulk compression and now also compress the key part.

Here is a result on TPCH's lineitem:
Direct(incremental) compression, and does not compress key part:
274982705   hdfs://10.61.0.160:9000/user/hdfs/tpch1G_rc
First Buffered then compress(Bulk Compression), and compress key part:
188401365   hdfs://10.61.0.160:9000/user/hdfs/tpch1G_newRC


BTW, I also tried to implement direct(incremental) compression, and tried to decompress a
value buffer's columns part by part. But at the last step( when implementing ValueBuffer's
readFields), i noticed that it is not very easy to implement it. Because we only hold on InputStream
to the underlying file, and we need to seek back and forth to decompress part of each columns,
and also we need to hold one decompress stream for each column. If we seek the inputstream,
the decompress stream is corrupt. 
To avoid all these, we need to read all needed columns' compressed data into memory, and do
in memory decompress. But we stil need one decompress stream for each column. I stop implementing
this at the last step, if it is needed i can finish it.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch,
hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch,
hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, HIve-352-draft-2009-03-28.patch,
Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive
to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will
need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message