hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3889) distcp overwrites files even when there are missing checksums
Date Thu, 06 Sep 2012 21:34:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450066#comment-13450066

Colin Patrick McCabe commented on HDFS-3889:

bq. If the goal is to just provide the same functionality as rsync, then sure. Although I
consider those less reliable (or just as bad) as file size alone. They require the metadata
to be kept in sync between source and destination, something that I don't think is very common
for mod time or access time, for example.

I believe that the modification time is set based on the NN, not the clients.  So nothing
needs to be kept in sync.  It's true that time can sometimes go backwards on the NN (due to
server misconfiguration, NTP, or other things) but it's not exactly common.

Still, I could go either way on this point.  It's nice to know that you're doing the safe
thing, and refusing to skip pre-copy checksum definitely is the safe thing.

Also, we currently aren't doing as much checking as we should do.  We don't consider the mtime
and owner, group, etc at the moment.  This makes skipping the checksum a lot more unsafe than
it needs to be.
> distcp overwrites files even when there are missing checksums
> -------------------------------------------------------------
>                 Key: HDFS-3889
>                 URL: https://issues.apache.org/jira/browse/HDFS-3889
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 2.2.0-alpha
>            Reporter: Colin Patrick McCabe
>            Priority: Minor
> If distcp can't read the checksum files for the source and destination files-- for any
reason-- it ignores the checksums and overwrites the destination file.  It does produce a
log message, but I think the correct behavior would be to throw an error and stop the distcp.
> If the user really wants to ignore checksums, he or she can use {{-skipcrccheck}} to
do so.
> The relevant code is in DistCpUtils#checksumsAreEquals:
> {code}
>     try {
>       sourceChecksum = sourceFS.getFileChecksum(source);
>       targetChecksum = targetFS.getFileChecksum(target);
>     } catch (IOException e) {
>       LOG.error("Unable to retrieve checksum for " + source + " or " + target, e);
>     }
> {code}

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message