hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline
Date Sun, 02 Dec 2012 02:49:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508134#comment-13508134
] 

Tsz Wo (Nicholas), SZE commented on HDFS-3875:
----------------------------------------------

Hi Kihwal,

In a client write pipeline, only the last datanode verifies checksum.  If there is a checksum
error, we don't know what goes wrong.  It could be the cases that one of the datanodes is
faulty or a network path is faulty.  So, the client must stop but cannot simply take out a
datanode and continue.  Do you agree?

In the patch, only the last datanode possibly reports checksum error.  If it does, all statuses
in the ack become ERROR_CHECKSUM.  The approach seems reasonable.

Some questions on the patch:
- receivePacket() returns -1 for checksum error.  Why not throw an exception?  Returning -1
should mean exit normally.
- The exception caught is not used below.  Should it re-throw the exception?
{code}
+      if (shouldVerifyChecksum()) {
+        try {
+          verifyChunks(dataBuf, checksumBuf);
+        } catch (IOException e) {
+          // checksum error detected locally. there is no reason to continue.
+          if (responder != null) {
+            ((PacketResponder) responder.getRunnable()).enqueue(seqno,
+                lastPacketInBlock, offsetInBlock,
+                Status.ERROR_CHECKSUM);
+          }
+          // return without writing data.
+          checksumError = true;
+          return -1;
+        }
{code}

                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Blocker
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt,
hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is storing the data
with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it received from
the first node. We don't know if the client sent a bad checksum, or if it got corrupted between
node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it threw an exception.
The pipeline started up again with only one replica (the first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and unrecoverable
since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message