hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3212) RMNode State Transition Update with DECOMMISSIONING state
Date Mon, 27 Apr 2015 14:16:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514182#comment-14514182

Junping Du commented on YARN-3212:

bq. we also need to verify the scheduler hasn't allocated or handed out a container for that
node that hasn't reached the node yet other than only check application status.
Just think of this problem again. The other option is we can still go ahead to mark this node
as decommissioned, but make AM/RM sync on the same page. 
It depends on how we understand the word - "graceful" here: if it means less expensive/cost
in decommissioning nodes, then this case should fall into this category as releasing an unlaunched
container is pretty cheap which could be better than wait the container to executed from beginning;
if we think it means clean scheduling flow and log messages (at least within timeout), we
may should wait container get launching. 

> RMNode State Transition Update with DECOMMISSIONING state
> ---------------------------------------------------------
>                 Key: YARN-3212
>                 URL: https://issues.apache.org/jira/browse/YARN-3212
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, YARN-3212-v2.patch, YARN-3212-v3.patch
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition
from “running” state triggered by a new event - “decommissioning”. 
> This new state can be transit to state of “decommissioned” when Resource_Update if
no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event
(after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous decommission
by calling recommission on the same node. The reaction to other events is similar to RUNNING

This message was sent by Atlassian JIRA

View raw message