hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13340) Compress Hadoop Archive output
Date Thu, 07 Jul 2016 14:21:10 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366181#comment-15366181
] 

Jason Lowe commented on HADOOP-13340:
-------------------------------------

Yes a splittable codec could be used to accomplish something simmilar, but again the splits
won't necessarily occur on the file boundaries -- depending upon the codec block size a significant
amount of data may need to be decompressed and thrown away before arriving at the original
file data within the codec block.  (Note that a larger codec block size could compress multiple
original small files and get a better overall compression ratio, so there's a tradeoff.)

As I mentioned above, if the intent is to compress the original files on file boundaries when
adding them to the har then IMHO the problem is the original files should have been compressed
in the first place before trying to do the har.  Otherwise those intending to consume the
original files will find compressed data in the har rather than original file data and will
need to know that they need a codec to get back to the original file contents.  If the purpose
of this request is to provide transparent compression within the har then that will need a
splittable codec or reset the codec on file boundaries and set flags in the har (in a backwards-compatible
manner) to indicate how compression was performed so the resulting input stream can compensate
for how the data is laid out within the har.

> Compress Hadoop Archive output
> ------------------------------
>
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
>
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, it's very neccessary
for data retention for everyone (small files problem and compress data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message