hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7374) Allow decommissioning of dead DataNodes
Date Fri, 14 Nov 2014 00:28:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211582#comment-14211582

Andrew Wang commented on HDFS-7374:

This issue is definitely tricky. I agree with everyone's discussion thus far, thanks especially
to [~mingma] for weighing in with insights from HDFS-6791. I think Zhe's proposal #2 is good,
and we should work on getting it in. As a follow-on, we could also consider trying to expose
more information to operators to help them decide if they should "force decom" by messing
with the exclude file. The core issue IIUC is knowing if a force decom will result in data
loss, which could probably be pieced together from fsck, but is by no means cheap to do.

With that, some light patch comments:

* I think the logic is a bit wrong right now, since it can shortcut a node from (DEAD, DECOM_IN_PROGRESS)
to (DEAD, DECOMMED) if refresh is called when the node is in the exclude file, where IIUC
what we want is to only allow (DEAD, NORMAL) to (DEAD, DECOMMED).
* Because of the above, would be good to move this logic into startDecommission instead, we
also want to be doing some log prints even in this situation
* We can use GenericTestUtils#waitFor to do the waitForDatanodeState, it prints a nice stack
trace as a benefit.
* Could clean up the imports in TestDeadDatanode, TestDecommissioningStatus
* TestDecomm, could remove the println, line longer than 80chars, could also add a test timeout

Thanks for working on this Zhe!

> Allow decommissioning of dead DataNodes
> ---------------------------------------
>                 Key: HDFS-7374
>                 URL: https://issues.apache.org/jira/browse/HDFS-7374
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Zhe Zhang
>            Assignee: Zhe Zhang
>         Attachments: HDFS-7374-001.patch
> We have seen the use case of decommissioning DataNodes that are already dead or unresponsive,
and not expected to rejoin the cluster.
> The logic introduced by HDFS-6791 will mark those nodes as {{DECOMMISSION_INPROGRESS}},
with a hope that they can come back and finish the decommission work. If an upper layer application
is monitoring the decommissioning progress, it will hang forever.

This message was sent by Atlassian JIRA

View raw message