hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-819) Add lazy decompress ability to RCFile
Date Sat, 19 Sep 2009 15:25:16 GMT

    [ https://issues.apache.org/jira/browse/HIVE-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757667#action_12757667
] 

He Yongqiang commented on HIVE-819:
-----------------------------------

>>1) in RCFile.c:307 it seems decompress() can be called multiple times and the function
doesn't check if the data is already decompressed, and if so return. This may not cause problem
in this diff since the callers will check if the data is decompressed or not before calling
decompress(), but since it is a public function and it doesn't prevent future callers call
this function twice. So it may be better to implement this check inside the decompress() function.

The only entrance of RCFile -> LazyDecompressionCallbackImpl's  decompress() is from BytesRefWritable.
If we checked if it is already decompressed inside BytesRefWritable, do we need to add that
check also in LazyDecompressionCallbackImpl? 

>>2) Also the same decompress() function, it seems it doesn't work correctly when the
column is not compressed. Can you double check it?
>From my tests, it works correctly for not compressed data.

>>3)
added tests:
{noformat}
DROP TABLE rcfileTableLazyDecompress;
CREATE table rcfileTableLazyDecompress (key STRING, value STRING) STORED AS RCFile;

FROM src
INSERT OVERWRITE TABLE rcfileTableLazyDecompress SELECT src.key, src.value LIMIT 10;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238 and key < 400;

SELECT key, count(1) FROM rcfileTableLazyDecompress where key > 238 group by key;

set mapred.output.compress=true;
set hive.exec.compress.output=true;

FROM src
INSERT OVERWRITE TABLE rcfileTableLazyDecompress SELECT src.key, src.value LIMIT 10;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238;

SELECT key, value FROM rcfileTableLazyDecompress where key > 238 and key < 400;

SELECT key, count(1) FROM rcfileTableLazyDecompress where key > 238 group by key;

set mapred.output.compress=false;
set hive.exec.compress.output=false;

DROP TABLE rcfileTableLazyDecompress;
{noformat}

Ning, thanks for your suggestions! Do i miss tests for your comments? 
For the check to avoid call decompress multiple times, what do you think if we move the check
from BytesRefWritable to LazyDecompressionCallbackImpl? There still will be 
some minor check duplication.

> Add lazy decompress ability to RCFile
> -------------------------------------
>
>                 Key: HIVE-819
>                 URL: https://issues.apache.org/jira/browse/HIVE-819
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0
>
>         Attachments: hive-819-2009-9-12.patch
>
>
> This is especially useful for a filter scanning. 
> For example, for query 'select a, b, c from table_rc_lazydecompress where a>1;' we
only need to decompress the block data of b,c columns when one row's column 'a' in that block
satisfies the filter condition.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message