hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-759) add hive.intermediate.compression.codec option
Date Tue, 18 Aug 2009 05:57:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744377#action_12744377
] 

Zheng Shao commented on HIVE-759:
---------------------------------

If user specify lzo and lzo cannot be loaded, we should output an error instead of changing
it to non-compression. That will silently hide the problem from the user.

We know lzo is better, but nowhere in the hadoop code do we set the default to lzo right?

What about making the default the same as "mapred.output.compression.*"? That might be a better
default since it does not change the current behavior if the user does not know about this
update.


> add hive.intermediate.compression.codec option
> ----------------------------------------------
>
>                 Key: HIVE-759
>                 URL: https://issues.apache.org/jira/browse/HIVE-759
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>         Attachments: hive-759-2009-08-17.patch, hive-759-2009-08-18.patch
>
>
> Hive uses the jobconf compression codec for all map-reduce jobs. This includes both mapred.map.output.compression.codec
and mapred.output.compression.codec.
> In some cases, we want to distinguish between the codec used for intermediate map-reduce
jobs (that produces intermediate data between jobs) and the final map-reduce jobs (that produces
data stored in tables).
> For intermediate data, lzo might be a better fit because it's much faster; for final
data, gzip might be a better fit because it saves disk spaces.
> We should introduce two new options:
> {code}
> hive.intermediate.compression.codec=org.apache.hadoop.io.compress.LzoCodec
> hive.intermediate.compression.type=BLOCK
> {code}
> And use these 2 options to override the mapred.output.compression.* in the FileSinkOperator
that produces intermediate data.
> Note that it's possible that a single map-reduce job may have 2 FileSInkOperators: one
produces intermediate data, and one produces final data. So we need to add a flag to fileSinkDesc
for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message