hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinayakumar B (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10714) Issue in handling checksum errors in write pipeline when fault DN is LAST_IN_PIPELINE
Date Thu, 15 Sep 2016 17:56:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494082#comment-15494082
] 

Vinayakumar B commented on HDFS-10714:
--------------------------------------

bq. In HDFS-6937 case, if DN3 gives ERROR_CHECKSUM error, DN3 will be replaced. But here DN2
got replaced. Would you please add some code snippet to explain how that happened? thanks.
At first DN3 only will be marked bad and replaced. And a reference will be kept DN2 as sender
during previous checksum error. If checksum error found again in DN4 (which was replaced in
place of DN3), then DN2 will be marked as BAD, provided DN2's local replica found valid in
both times.

Here is the code snippet.
{code}
+      int currentBad = badNodeIndex;
+      /*
+       * When the checksum error found during transfer of packets, finding out
+       * the actual faulty node is tricky. So following below steps.
+       * 1. First remove the node which reported CHECKSUM error as bad.
+       *  and Keep track of it.
+       * 2. If second time CHECKSUM error reported and sender is same as
+       *  earlier, this time sender will be removed instead of the reporter.
+       */
+      if (checkSumError && badNodeIndex > 0) {
+        if (prevChecksumErrorSenderNode != null) {
+          // If same node involved with second checksum error, then its clear
+          // that sender is the faulty node. 
+          if (prevChecksumErrorSenderNode.equals(nodes[badNodeIndex - 1])) {
+            badNodeIndex = badNodeIndex - 1;
+            errorState.setBadNodeIndex(badNodeIndex);
+            prevChecksumErrorSenderNode = nodes[badNodeIndex - 1];
+            LOG.warn("Bad node is changed to " + nodes[badNodeIndex]
+                + " instead of " + nodes[currentBad]
+                + " as this node caused checksum error in previous pipeline");
+          }
+        } else {
+          prevChecksumErrorSenderNode = nodes[badNodeIndex - 1];
+          LOG.warn("Bad node is : " + nodes[badNodeIndex]);
+        }
+      }
{code}

> Issue in handling checksum errors in write pipeline when fault DN is LAST_IN_PIPELINE
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-10714
>                 URL: https://issues.apache.org/jira/browse/HDFS-10714
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>         Attachments: HDFS-10714-01-draft.patch
>
>
> We had come across one issue, where write is failed even 7 DN’s are available due to
network fault at one datanode which is LAST_IN_PIPELINE. It will be similar to HDFS-6937 .
> Scenario : (DN3 has N/W Fault and Min repl=2).
> Write pipeline:
> DN1->DN2->DN3  => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad
> DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad
> ….
> And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no more datanodes
to construct the pipeline.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message