hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
Date Wed, 11 Jan 2017 22:23:17 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15819365#comment-15819365
] 

Koji Noguchi commented on HADOOP-13114:
---------------------------------------

bq. Could you please elucidate your concern if its not that?

My point is, this command won't be useful unless the compressed outputs are directly readable
by hadoop jobs.
Avro, Orc, RCFile, SequenceFile etc and other common file formats all have their own ways
of compressing and simply gzip/bzip-ing the entire files won't do any good.
Worse, I don't think the patch provides a way to uncompress them back.

bq.  but that means we'd make assumptions about Hadoop's use cases

And I'd say you're assuming users would only call this distcp+compress on text files only.
Files with other fileformat would become unreadable (until uncompressed back).


I agree with Nathan on the naming. If the command is called {{dist-text-compress}}, then I'll
have no concerns.

> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, HADOOP-13114-trunk_2016-05-08-1.patch,
HADOOP-13114-trunk_2016-05-10-1.patch, HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch,
HADOOP-13114.06.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified compression format.
This avoids one hop of compressing data after transfer. Backup strategies to different cluster
also get benefit of saving one IO operation to and from HDFS, thus saving resources, time
and effort.
> * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}.

> * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec extension
to indicate the file is compressed. Thus users can be aware of what codec was used to compress
the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message