hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1231) Distcp is very slow
Date Fri, 27 Nov 2009 04:45:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783023#action_12783023

Hemanth Yamijala commented on MAPREDUCE-1231:

I looked at the Yahoo! Hadoop 0.20 patch. One minor nit is that the internal config option
name is different between this and the trunk patch. In the trunk patch, the option is distcp.skip.crc.check.
In the internal patch it is distcp.skip.crc. Since this is a jobconf option, it may be better
to keep these in sync. At the very least, it avoids confusion when Hadoop is upgraded to the
trunk version.

Other than this, the 20 patch looks good.

Another point, (unrelated to this JIRA), is that the way the post-copy validation is done
between trunk and 20 seems different. In trunk, this is done by a call to the API sameFile().
Hence, it includes CRC checks by default. In the internal 20 patch, this check is done only
on file lengths irrespective of the option to skip crc checks. It is unclear whether this
is by design. At any rate, this inconsistency is not related to this patch.

> Distcp is very slow
> -------------------
>                 Key: MAPREDUCE-1231
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1231
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distcp
>    Affects Versions: 0.20.1
>            Reporter: Jothi Padmanabhan
>            Assignee: Jothi Padmanabhan
>         Attachments: mapred-1231-v1.patch, mapred-1231-v2.patch, mapred-1231-v3.patch,
mapred-1231-v3.patch, mapred-1231-y20-v2.patch, mapred-1231-y20-v3.patch, mapred-1231-y20.patch,
> Currently distcp does a checksums check in addition to file length check to decide if
a remote file has to be copied. If the number of files is high (thousands), this checksum
check is proving to be fairly costly leading to a long time before the copy is started.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message