hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13114) DistCp should have option to compress data on write
Date Tue, 10 Jan 2017 22:16:59 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15816356#comment-15816356
] 

Ravi Prakash commented on HADOOP-13114:
---------------------------------------

Thanks Koji! I was under the impression that even binary files could be compressed quite well.
For e.g. if I compress /usr/bin/xsane (a binary file)
{code}
[raviprak@ravi ~]$ ls -alh xsane.gz 
-rwxr-xr-x 1 raviprak raviprak 298K Jan 10 11:06 xsane.gz
[raviprak@ravi ~]$ ls -alh /usr/bin/xsane
-rwxr-xr-x 1 root root 744K Feb  5  2016 /usr/bin/xsane
{code}
The question is how many "binary" files we expect to be on HDFS, but that means we'd make
assumptions about Hadoop's use cases and I'm not sure I want to hazard that. I'm sorry if
I misunderstand you. Could you please elucidate your concern if its not that?

Thanks Nathan! I am ambivalent about this myself. Ideally we'd want to compress during transit
(like {{rsync -z}}), but this JIRA was split out of that desire (from HADOOP-8065). For a
variety of reasons HADOOP-8065 has been requested by a lot of _our_ customers (in addition
to the hadoop users you can see in the voters and watchers list.) Also, a few first-time contributors
went above and beyond on this JIRA.

bq. What happens if we run the command with compression twice? distcp a->b, then b->c?
I'm assuming c is a compressed version of b which is a compressed version of a. In order to
read we'd have to unwind both layers of compression. Seems strange and really easy to accidentally
have this happen.
You are right that compressed files would be nested, one inside the other. Compression tools
would do similar nesting, won't they? So I'm not sure it can be helped. And if I had checked
the compression status, I'm sure someone will pipe up and say that I should have been nesting
;-) Perhaps yet another flag?

bq. Obvious question is: "if it's valuable to compress, why wasn't it compressed in the first
place?"
In my experience, some times the source hadoop cluster is not in the control of the copier,
or has a lot more capacity (and so compression there is not a concern). Sometimes the source
is written by IoT objects into a staging area, and rather than have a separate job that compresses
data, it'd be helpful to combine the copy with the compression. 

bq. Just the name bothers me a bit. copy commands don't normally transform data, but this
one would.
Having said that, I do feel this argument is particularly compelling. I am not sure if this
would be breaking precedent considering there is {{--append}} which is not exactly a "copy"
either, but I do agree with your concern.

For now I will stop work on this JIRA unless I hear from a few more diverse viewpoints.

> DistCp should have option to compress data on write
> ---------------------------------------------------
>
>                 Key: HADOOP-13114
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13114
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Suraj Nayak
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>         Attachments: HADOOP-13114-trunk_2016-05-07-1.patch, HADOOP-13114-trunk_2016-05-08-1.patch,
HADOOP-13114-trunk_2016-05-10-1.patch, HADOOP-13114-trunk_2016-05-12-1.patch, HADOOP-13114.05.patch,
HADOOP-13114.06.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> DistCp utility should have capability to store data in user specified compression format.
This avoids one hop of compressing data after transfer. Backup strategies to different cluster
also get benefit of saving one IO operation to and from HDFS, thus saving resources, time
and effort.
> * Create an option -compressOutput defaulting to {{org.apache.hadoop.io.compress.BZip2Codec}}.

> * Users will be able to change codec with {{-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec}}
> * If distcp compression is enabled, suffix the filenames with default codec extension
to indicate the file is compressed. Thus users can be aware of what codec was used to compress
the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message