hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lei (Eddy) Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7877) Support maintenance state for datanodes
Date Tue, 10 Mar 2015 20:09:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355589#comment-14355589

Lei (Eddy) Xu commented on HDFS-7877:

Hi, [~mingma]. This work looks great and more comprehensive than HDFS-6729.  Especially I
like the design that NN checks the single replica of blocks before setting DN to maintenance
mode: it is safer than HDFS-6729.  

I have a few questions regarding the rest of your design.

* Why is the node state the combination of {{<live|dead>}} and {{In service|Decommissioned|In
maintenance..}}? Do we need to keep a DN in {{maintenance}} mode if it is dead? It makes the
state machine very complex. 
* DN state (e.g., enter_maintenance or in_maintenance ) is kept in NN's memory? After NN re-starts,
I think NN could not find out whether DN is in {{enter_maintenance}} or {{in_maintenance}}
mode? Is there any default mode you will assume for a DN? Or is there a way for NN to decide
which state the DN is in?
* Moreover, after NN restarts, if a DN is actually in the maintenance mode (DN is shutting
down for maintenance), NN could not receive block reports from this DN. If this is the case,
would NN miscalculate the blockMap?
* bq. put the dead node into maintenance mode
Would it be necessary? As you mentioned, when a DN is dead, its blocks are already replicated
to other nodes. In my understand, the maintenance mode is a way to let NN not to move data
when the DN is actually offline. The logic, which brings back a {{dead IN_MAINTENANCE}} DN
and removes replicas from block maps, looks very similar to restart a (dead) DN. Could it
simply reuse that logic?
* In HDFS-6729, I considered maintenance mode as a temporary soft state, because what I understand
is that putting a DN into maintenance mode is risking the availability of data. It essentially
asks NN to ignore one "dead" (in maintenance) replica. As a result, I did not put DNs into
a persistent configure file and let user to specify a timeout for DN to be in maintenance
mode. When the timeout expires (i.e., 1 hour maintenance window), NN considers this DN as
dead and re-replicas blocks on this DN to somewhere else. Does it make sense to you? Could
you address this concern in your design?

Looking forward to hear from you, [~mingma]. Thanks again for this great work!

> Support maintenance state for datanodes
> ---------------------------------------
>                 Key: HDFS-7877
>                 URL: https://issues.apache.org/jira/browse/HDFS-7877
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Ming Ma
>         Attachments: HDFS-7877.patch, Supportmaintenancestatefordatanodes.pdf
> This requirement came up during the design for HDFS-7541. Given this feature is mostly
independent of upgrade domain feature, it is better to track it under a separate jira. The
design and draft patch will be available soon.

This message was sent by Atlassian JIRA

View raw message