hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
Date Wed, 12 Aug 2015 11:48:46 GMT

    [ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693360#comment-14693360

Junping Du commented on YARN-3212:

Thanks [~sunilg] for the comments! I agree this is not a bad idea for node in decommissioning
to give more chances for nodes just in UNHEALTHY. However, it will involve more complexities,
like: how much rounds we should wait (heartbeat number or timing, a separated configuration?),
an additional state for the node that is in decommissioning and unhealthy, etc. We should
evaluate if it worth it before we have hands-on experience on this new feature. In practically,
I saw rare cases that nodes can back to healthy state quite soon (unless get fixed immediately
with people log in) - that's saying within the timeout. 
Thus, I would prefer to keep the current transition which sounds slightly aggressively but
a good trade-off with simplicity at this moment. I can put a TODO in later patch (if other
outstanding issues according to the comments) to think more on this when we back with more
experiences. Make sense?

> RMNode State Transition Update with DECOMMISSIONING state
> ---------------------------------------------------------
>                 Key: YARN-3212
>                 URL: https://issues.apache.org/jira/browse/YARN-3212
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, YARN-3212-v2.patch, YARN-3212-v3.patch,
YARN-3212-v4.1.patch, YARN-3212-v4.patch, YARN-3212-v5.1.patch, YARN-3212-v5.patch
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition
from “running” state triggered by a new event - “decommissioning”. 
> This new state can be transit to state of “decommissioned” when Resource_Update if
no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event
(after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous decommission
by calling recommission on the same node. The reaction to other events is similar to RUNNING

This message was sent by Atlassian JIRA

View raw message