hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongjun Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8065) distcp should have an option to compress data while copying.
Date Tue, 17 Jan 2017 06:44:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15825550#comment-15825550
] 

Yongjun Zhang commented on HADOOP-8065:
---------------------------------------

Hi Guys,

Thanks for the work at HADOOP-13114, which I just commented.

About HADOOP-8065,

{quote}
We would like compress the data while transferring from our source system to target system.
One way to do this is to write a map/reduce job to compress that after/before being transferred.
This looks inefficient. 
Since distcp already reading writing data it would be better if it can accomplish while doing
this.
{quote}

Compressing data while transferring data means we need to skip checksum comparison during
the transfer. Since multiple blocks maybe compressed into a single block, the checksum can
only be possibly verified after decompressing the data. However, due to the existence of variable
block size, this could be error prone.

We could possibly implement something like DFSOutputStreamWithCompression, that compress input
data before writing out, that can be used by not only distcp with regard to this jira, but
also other tools.

Thanks.


> distcp should have an option to compress data while copying.
> ------------------------------------------------------------
>
>                 Key: HADOOP-8065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8065
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.20.2
>            Reporter: Suresh Antony
>            Assignee: Suraj Nayak
>            Priority: Minor
>              Labels: distcp
>             Fix For: 0.20.2
>
>         Attachments: HADOOP-8065.005.patch, HADOOP-8065.006.patch, HADOOP-8065-trunk_2015-11-03.patch,
HADOOP-8065-trunk_2015-11-04.patch, HADOOP-8065-trunk_2016-04-29-4.patch, patch.distcp.2012-02-10
>
>
> We would like compress the data while transferring from our source system to target system.
One way to do this is to write a map/reduce job to compress that after/before being transferred.
This looks inefficient. 
> Since distcp already reading writing data it would be better if it can accomplish while
doing this. 
> Flip side of this is that distcp -update option can not check file size before copying
data. It can only check for the existence of file. 
> So I propose if -compress option is given then file size is not checked.
> Also when we copy file appropriate extension needs to be added to file depending on compression
type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message