hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3889) distcp overwrites files even when there are missing checksums
Date Thu, 06 Sep 2012 20:51:09 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450031#comment-13450031
] 

Marcelo Vanzin commented on HDFS-3889:
--------------------------------------

bq. What if the source and destination clusters have different checksum types, or one of the
checksums is missing?

That means that you can't reasonably detect whether both files are equal, so the code should
fall back to the safe path, which is to assume they are not equal and that a copy should be
performed. Since manually computing the checksums (by reading both source and destination
files) and just copying the file would be about the same performance-wise, it should be fine.

"-update" is an optimization to avoid copying redundant data. Nothing will break if you just
overwrite the target data with the source, it will just be slower than if the checksum checks
were possible.
                
> distcp overwrites files even when there are missing checksums
> -------------------------------------------------------------
>
>                 Key: HDFS-3889
>                 URL: https://issues.apache.org/jira/browse/HDFS-3889
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 2.2.0-alpha
>            Reporter: Colin Patrick McCabe
>            Priority: Minor
>
> If distcp can't read the checksum files for the source and destination files-- for any
reason-- it ignores the checksums and overwrites the destination file.  It does produce a
log message, but I think the correct behavior would be to throw an error and stop the distcp.
> If the user really wants to ignore checksums, he or she can use {{-skipcrccheck}} to
do so.
> The relevant code is in DistCpUtils#checksumsAreEquals:
> {code}
>     try {
>       sourceChecksum = sourceFS.getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " + target, e);
>     }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message