Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1006B18E4A for ; Tue, 15 Dec 2015 19:18:49 +0000 (UTC) Received: (qmail 61590 invoked by uid 500); 15 Dec 2015 19:18:47 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 60998 invoked by uid 500); 15 Dec 2015 19:18:47 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 60657 invoked by uid 99); 15 Dec 2015 19:18:47 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Dec 2015 19:18:47 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 00BE32C1F82 for ; Tue, 15 Dec 2015 19:18:47 +0000 (UTC) Date: Tue, 15 Dec 2015 19:18:47 +0000 (UTC) From: "Rushabh S Shah (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HDFS-9558) Replication requests always blames the source datanode in case of Checksum Exception. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Rushabh S Shah created HDFS-9558: ------------------------------------ Summary: Replication requests always blames the source datanode in case of Checksum Exception. Key: HDFS-9558 URL: https://issues.apache.org/jira/browse/HDFS-9558 Project: Hadoop HDFS Issue Type: Bug Components: datanode Reporter: Rushabh S Shah Replication requests from datanode (in case of rack failure event) always blames the source datanode if any of the downstream nodes encounters ChecksumException. We saw this case recently in our cluster. We lost 7 nodes in a rack. There was only one replica of the block (say on dnA). The namenode asks dnA to replicate to dnB and dnC. {noformat} 2015-12-13 21:09:41,798 [DataNode: heartbeating to NN:8020] INFO datanode.DataNode: DatanodeRegistration(dnA, datanodeUuid=bc1f183d-b74a-49c9-ab1a-d1d496ab77e9, infoPort=1006, infoSecurePort=0, ipcPort=8020, storageInfo=lv=-56;cid=CID-e7f736ac-158e-446e-9091-7e66f3cddf3c;nsid=358250775;c=1428471998571) Starting thread to transfer BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 to dnB:1004 dnC:1004 {noformat} All the packets going out from dnB's interface were getting corrupted. So dnC received corrupt block and it reported bad block (from dnA) to namenode. Following are the logs from dnC: {noformat} 2015-12-13 21:09:43,444 [DataXceiver for client at /dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] WARN datanode.DataNode: Checksum error in block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from /dnB:34879 org.apache.hadoop.fs.ChecksumException: Checksum error: at 58368 exp: -1657951272 got: 856104973 at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native Method) at org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:416) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:550) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:853) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2015-12-13 21:09:43,445 [DataXceiver for client at dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] INFO datanode.DataNode: report corrupt BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from datanode dnA:1004 to namenode {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)