hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rushabh S Shah (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-9558) Replication requests always blames the source datanode in case of Checksum Exception.
Date Tue, 15 Dec 2015 19:18:47 GMT
Rushabh S Shah created HDFS-9558:
------------------------------------

             Summary: Replication requests always blames the source datanode in case of Checksum
Exception.
                 Key: HDFS-9558
                 URL: https://issues.apache.org/jira/browse/HDFS-9558
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
            Reporter: Rushabh S Shah


Replication requests from datanode (in case of rack failure event) always blames the source
datanode if any of the downstream nodes encounters ChecksumException.
We saw this case recently in our cluster.
We lost  7 nodes in a rack.
There was only one replica of the block (say on dnA).
The namenode asks dnA to replicate to dnB and dnC.
{noformat}
2015-12-13 21:09:41,798 [DataNode:   heartbeating to NN:8020] INFO datanode.DataNode: DatanodeRegistration(dnA,
datanodeUuid=bc1f183d-b74a-49c9-ab1a-d1d496ab77e9, infoPort=1006, infoSecurePort=0, ipcPort=8020,
storageInfo=lv=-56;cid=CID-e7f736ac-158e-446e-9091-7e66f3cddf3c;nsid=358250775;c=1428471998571)
Starting thread to transfer BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617
to dnB:1004 dnC:1004 
{noformat}

All the packets going out from dnB's interface were getting corrupted.
So dnC  received corrupt block and it reported bad block (from dnA) to namenode.
Following are the logs from dnC:
{noformat}
2015-12-13 21:09:43,444 [DataXceiver for client  at /dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]]
WARN datanode.DataNode: Checksum error in block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617
from /dnB:34879
org.apache.hadoop.fs.ChecksumException: Checksum error:  at 58368 exp: -1657951272 got: 856104973
        at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native Method)
        at org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
        at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
        at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:416)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:550)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:853)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
        at java.lang.Thread.run(Thread.java:745)
2015-12-13 21:09:43,445 [DataXceiver for client  at dnB:34879 [Receiving block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]]
INFO datanode.DataNode: report corrupt BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617
from datanode dnA:1004 to namenode
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message