Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B77BE931F for ; Thu, 6 Sep 2012 20:45:10 +0000 (UTC) Received: (qmail 18180 invoked by uid 500); 6 Sep 2012 20:45:10 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 18142 invoked by uid 500); 6 Sep 2012 20:45:10 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 18132 invoked by uid 99); 6 Sep 2012 20:45:10 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Sep 2012 20:45:10 +0000 Date: Fri, 7 Sep 2012 07:45:10 +1100 (NCT) From: "Colin Patrick McCabe (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1529007410.46560.1346964310445.JavaMail.jiratomcat@arcas> In-Reply-To: <1487364189.36798.1346807467945.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (HDFS-3889) distcp overwrites files even when there are missing checksums MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450024#comment-13450024 ] Colin Patrick McCabe commented on HDFS-3889: -------------------------------------------- Let's call case #1 the "pre-copy check," and case #2 the "post-copy check." The problem with trying to force everyone to do the pre-copy check unconditionally is that not everyone can do it efficiently. What if the source and destination clusters have different checksum types, or one of the checksums is missing? You have to fall back on a slow strategy of computing your own checksum on one or both sides. > distcp overwrites files even when there are missing checksums > ------------------------------------------------------------- > > Key: HDFS-3889 > URL: https://issues.apache.org/jira/browse/HDFS-3889 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools > Affects Versions: 2.2.0-alpha > Reporter: Colin Patrick McCabe > Priority: Minor > > If distcp can't read the checksum files for the source and destination files-- for any reason-- it ignores the checksums and overwrites the destination file. It does produce a log message, but I think the correct behavior would be to throw an error and stop the distcp. > If the user really wants to ignore checksums, he or she can use {{-skipcrccheck}} to do so. > The relevant code is in DistCpUtils#checksumsAreEquals: > {code} > try { > sourceChecksum = sourceFS.getFileChecksum(source); > targetChecksum = targetFS.getFileChecksum(target); > } catch (IOException e) { > LOG.error("Unable to retrieve checksum for " + source + " or " + target, e); > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira