hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
Date Fri, 07 Nov 2014 05:47:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201646#comment-14201646
] 

Ming Ma commented on HDFS-7374:
-------------------------------

Zhe, thanks for reporting this.

At the high level, there is a state machine for DN with total of 6 possible states, {{Live,
NORMAL}}, {{Live, DECOMMISSION_INPROGRESS}}, {{Live, DECOMMISSIONED}}, {{Dead, NORMAL}}, {{Dead,
DECOMMISSION_INPROGRESS}}, {{Dead, DECOMMISSIONED}}. Events such as node membership change
and decommission management will cause the state to change.

Your #1 suggestion is to have {{Dead, DECOMMISSION_INPROGRESS}} transition to {{Dead, DECOMMISSIONED}}
upon timeout. Not sure if that is the best approach. Your #2 suggestion have {{Dead, NORMAL}}
transition directly to {{Dead, DECOMMISSIONED}} upon decomm event. That sounds like like a
good idea to address your situation.

However, we still have the situation regarding which state {{Live, DECOMMISSION_INPROGRESS}}
should be transitioned to when DN becomes dead. HDFS-6791 makes it transition to {{Dead, DECOMMISSION_INPROGRESS}}.
It seems you want to make sure it eventually gets to {{Dead, DECOMMISSIONED}} state.

Some more ideas on this.

1. If the node stays in {{Dead, DECOMMISSION_INPROGRESS}} for too long, have the higher layer
application remove the node from exclude file and thus abort the decommission process. This
will transition the node to {{Dead, NORMAL}}.
2. HDFS-6791 mentioned another way to address the original issue. When nodes become dead,
mark them DECOMMISSIONED and fix the replication to handle this case. In other words, get
rid of {{Dead, DECOMMISSION_INPROGRESS}} state.

Initially I plan to refactor the code to have more explicit state transition. But didn't find
it worthwhile.




> Allow decommissioning of dead DataNodes
> ---------------------------------------
>
>                 Key: HDFS-7374
>                 URL: https://issues.apache.org/jira/browse/HDFS-7374
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>
> We have seen the use case of decommissioning DataNodes that are already dead or unresponsive,
and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as {{DECOMMISSION_INPROGRESS}},
with a hope that they can come back and finish the decommission work. If an upper layer application
is monitoring the decommissioning progress, it will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message