hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11609) Some blocks can be permanently lost if nodes are decommissioned while dead
Date Fri, 31 Mar 2017 20:14:41 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951587#comment-15951587
] 

Kihwal Lee commented on HDFS-11609:
-----------------------------------

h3. Inability to correctly guess the previous replication priority
Guessing the previous replication priority level of a block works most of times, but is not
perfect. Different orders of events can lead to the identical current state, but the previous
priority levels can differ.  We can improve the priority update method so that the guessing
logic still provides benefit in majority of cases, yet its correctness is not strictly necessary.

Following shows problems I encountered and solutions.

In {{UnderReplicatedBlocks}},
{code}
  private int getPriority(int curReplicas,
                          int readOnlyReplicas,
                          int decommissionedReplicas,
                          int expectedReplicas) {
    assert curReplicas >= 0 : "Negative replicas!";
{code}
This is called from {{update()}}, which calls it with {{curReplicas}} set to {{curReplicas-curReplicasDelta}}.
When all replica-containing nodes are dead ({{curReplicas}} is 0), but a decommissioned node
having a replica joins, {{update()}} is called with {{curReplicas}} of -1, which sets off
the assert.  This causes initial block report processing to stop in the middle.  This node
is live and decommissioned and the block will appear missing because the block report wasn't
processed due to the assertion failure.

This can be avoided if {{curReplicasDelta}} is not set to 1 if this replica is decommissioned.
This value originates from {{BlockManager}}'s {{addStoredBlock()}}.
{code}
     if (result == AddBlockResult.ADDED) {
-      curReplicaDelta = 1;
+      curReplicaDelta = (node.isDecommissioned()) ? 0 : 1;
{code}
This fixes this particular issue.

The assert is removed in the real build, so it acts differently in production runtime. Instead
block report processing blowing up, "-1" will cause it to return the level, {{QUEUE_VERY_UNDER_REPLICATED}}
without the above fix, which is incorrect.

If the previous priority level is guessed incorrectly and it happens to be identical to the
current level, the old entry won't be removed, resulting in duplicate entries. The {{remove()}}
method is already robust so if a block is not found in the specified level, it tries to remove
it from other priority levels too. So we can simply call {{remove()}} unconditionally. Guessing
the old priority is not functionally necessary with this change, but is still useful, since
the guess is normally correct which makes it visit only one priority level for removal in
most of cases.

{code}
-    if(oldPri != curPri) {
-      remove(block, oldPri);
-    }
+    // oldPri is mostly correct, but not always. If not found with oldPri,
+    // other levels will be searched until the block is found & removed.
+    remove(block, oldPri);
{code}

h3. Replication priority level of a block with only decommissioned replicas
With the surrounding bugs fixed, now we can address the real issue.  {{getPriority()}} explicitly
does this:
{code}
    } else if (curReplicas == 0) {
      // If there are zero non-decommissioned replicas but there are
      // some decommissioned replicas, then assign them highest priority
      if (decommissionedReplicas > 0) {
        return QUEUE_HIGHEST_PRIORITY;
      }
{code}

This does not make any sense. Since decommissioned nodes are never chose as a replication
source, the block cannot be re-replicated. Being at this priority, the block won't be recognized
as "missing" either.  It will appear that the cluster is healthy until the decommissioned
nodes are taken down, at which point it might be too late to recover the data.

There are several possible approaches to this.
1) If all it has is decommissioned replicas, show it as missing. I.e. priority level of {{QUEUE_WITH_CORRUPT_BLOCKS}}.
 {{fsck}} will show the decommissioned locations and the admin can recommission/decommission
or manually copy the data out.
2) Re-evaluate all replicas when a decommissioned node rejoins. The simplest way is to start
decommissioning the node again.
3) Allow a decommissioned replica to be picked as a replication source in this special case.
1) might still be needed.

I have a patch with 1) and a unit test, but want to hear from others before posting.

> Some blocks can be permanently lost if nodes are decommissioned while dead
> --------------------------------------------------------------------------
>
>                 Key: HDFS-11609
>                 URL: https://issues.apache.org/jira/browse/HDFS-11609
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When all the nodes containing a replica of a block are decommissioned while they are
dead, they get decommissioned right away even if there are missing blocks. This behavior was
introduced by HDFS-7374.
> The problem starts when those decommissioned nodes are brought back online. The namenode
no longer shows missing blocks, which creates a false sense of cluster health. When the decommissioned
nodes are removed and reformatted, the block data is permanently lost. The namenode will report
missing blocks after the heartbeat recheck interval (e.g. 10 minutes) from the moment the
last node is taken down.
> There are multiple issues in the code. As some cause different behaviors in testing vs.
production, it took a while to reproduce it in a unit test. I will present analysis and proposal
soon.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message