hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
Date Fri, 07 Nov 2014 05:47:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201646#comment-14201646

Ming Ma commented on HDFS-7374:

Zhe, thanks for reporting this.

At the high level, there is a state machine for DN with total of 6 possible states, {{Live,
DECOMMISSION_INPROGRESS}}, {{Dead, DECOMMISSIONED}}. Events such as node membership change
and decommission management will cause the state to change.

Your #1 suggestion is to have {{Dead, DECOMMISSION_INPROGRESS}} transition to {{Dead, DECOMMISSIONED}}
upon timeout. Not sure if that is the best approach. Your #2 suggestion have {{Dead, NORMAL}}
transition directly to {{Dead, DECOMMISSIONED}} upon decomm event. That sounds like like a
good idea to address your situation.

However, we still have the situation regarding which state {{Live, DECOMMISSION_INPROGRESS}}
should be transitioned to when DN becomes dead. HDFS-6791 makes it transition to {{Dead, DECOMMISSION_INPROGRESS}}.
It seems you want to make sure it eventually gets to {{Dead, DECOMMISSIONED}} state.

Some more ideas on this.

1. If the node stays in {{Dead, DECOMMISSION_INPROGRESS}} for too long, have the higher layer
application remove the node from exclude file and thus abort the decommission process. This
will transition the node to {{Dead, NORMAL}}.
2. HDFS-6791 mentioned another way to address the original issue. When nodes become dead,
mark them DECOMMISSIONED and fix the replication to handle this case. In other words, get
rid of {{Dead, DECOMMISSION_INPROGRESS}} state.

Initially I plan to refactor the code to have more explicit state transition. But didn't find
it worthwhile.

> Allow decommissioning of dead DataNodes
> ---------------------------------------
>                 Key: HDFS-7374
>                 URL: https://issues.apache.org/jira/browse/HDFS-7374
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
> We have seen the use case of decommissioning DataNodes that are already dead or unresponsive,
and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as {{DECOMMISSION_INPROGRESS}},
with a hope that they can come back and finish the decommission work. If an upper layer application
is monitoring the decommissioning progress, it will hang forever.

This message was sent by Atlassian JIRA

View raw message