hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] Updated: (HIVE-461) Optimize RCFile reading by using column pruning results
Date Tue, 26 May 2009 12:29:46 GMT


He Yongqiang updated HIVE-461:

    Attachment: hive-461-2009-05-26.patch

A first try. The main modifications lie in ColumnPruner, HiveInputFormat, SelectOperator,
and ExecDriver( one line). 
Also changed RCFile to set accepted column ids instead of skip column ids, and update testcases
to pass in accepted column ids.
hive-461-2009-05-26.patch works for simple query like "insert overwrite table rc2 select rc1.col1,
rc1.col2 from rc1", and have not tested with complex queries.

> Optimize RCFile reading by using column pruning results
> -------------------------------------------------------
>                 Key: HIVE-461
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>         Attachments: hive-461-2009-05-26.patch
> RCFile is a column-based file format introduced in HIVE-352. Column-based storage has
shown better compression ratio. On our internal data set (30 columns, most of them are short
integer strings), we are seeing gzip-compressed RCFile to be 20%+ smaller than gzip-compressed
> RCFIle also has the potential to improve the reading efficiency a lot since it compresses
each column separately.
> We should integrate RCFile with the column pruning results from Hive to make the reading

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message