hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13340) Compress Hadoop Archive output
Date Thu, 11 Jan 2018 20:10:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322869#comment-16322869

Jason Lowe commented on HADOOP-13340:

Choosing which files to compress doesn't really solve the issues I brought up in my previous
comment.  Even if we choose only to compress some of the files but not all of them, unless
we choose a splittable/seekable codec and provide transparent decoding in the HarFileSystem
layer it could change the semantics of how an application accesses the data before and after
it enters the .har archive.  (e.g.: app was working just fine on uncompressed data but doesn't
gracefully handle the compressed data, especially if it isn't splittable).  That would be
adding compression to the har that is not transparent.  I suppose as long as that's clearly
documented and the user expects that behavior it could be OK.

What needs to be clarified is the requirements and expectations of this feature.  Is the compression
transparent  (i.e.: data appears to be exactly as it was to anyone accessing the .har archive
yet it is actually stored compressed and transparently decoded during access) or simply each
file (optionally) compressed as it is added to the archive?  The latter has a straightforward
workaround today (i.e.: simply compress the original files before archiving them).  The former
would require support in HarFileSystem but could be nice for the common use-case for .har
archives which is packing together a lot of relatively small files.  The compression could
work across file boundaries achieving a greater compression ratio than if each flie were compressed
separately, with the overhead of needing to decode up to an entire codec block to access a
file's contents.

> Compress Hadoop Archive output
> ------------------------------
>                 Key: HADOOP-13340
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13340
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 2.5.0
>            Reporter: Duc Le Tu
>              Labels: features, performance
> Why Hadoop Archive tool cannot compress output like other map-reduce job? 
> I used some options like -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
but it's not work. Did I wrong somewhere?
> If not, please support option for compress output of Hadoop Archive tool, it's very neccessary
for data retention for everyone (small files problem and compress data).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message