hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5065) DistCp should skip checksum comparisons if block-sizes are different on source/target.
Date Fri, 15 Mar 2013 16:38:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603493#comment-13603493

Kihwal Lee commented on MAPREDUCE-5065:

bq. Another option might be to implement a checksum that's blocksize-independent...

Reading whole metadata may be too much, especially for huge files. It will be better if we
make computation happen where the data is. :)
Most hashing is incremental, so if DFSClient feeds the last state of hash into the next datanode
and let it continue updating it, the result will be independent of block size. The current
way of doing file checksum allows calculating individual block checksums in parallel, but
we are not taking advantage of it in DFSClient anyway. So I don't think there won't be any
significant changes in performance or overhead.

We should probably continue this discussion in a separate jira.
> DistCp should skip checksum comparisons if block-sizes are different on source/target.
> --------------------------------------------------------------------------------------
>                 Key: MAPREDUCE-5065
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5065
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distcp
>    Affects Versions: 2.0.3-alpha, 0.23.5
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
> When copying files between 2 clusters with different default block-sizes, one sees that
the copy fails with a checksum-mismatch, even though the files have identical contents.
> The reason is that on HDFS, a file's checksum is unfortunately a function of the block-size
of the file. So you could have 2 different files with identical contents (but different block-sizes)
have different checksums. (Thus, it's also possible for DistCp to fail to copy files on the
same file-system, if the source-file's block-size differs from HDFS default, and -pb isn't
> I propose that we skip checksum comparisons under the following conditions:
> 1. -skipCrc is specified.
> 2. File-size is 0 (in which case the call to the checksum-servlet is moot).
> 3. source.getBlockSize() != target.getBlockSize(), since the checksums are guaranteed
to differ in this case.
> I have a patch for #3.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message