hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj K (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
Date Mon, 09 Feb 2015 17:30:36 GMT

    [ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312507#comment-14312507

Devaraj K commented on YARN-41:

Thanks a lot [~djp] for your some more comments.

bq. your patch is actually working on decommission node, not "shutdown" (let's define call
yarn daemon stop or kill -9 on NodeManager as shutdown, just for get rid of any confusion),

It is not for decommissioning of NM and it is for handling 'yarn-daemon.sh stop nodemanager'
and kill nmPid (not for 'kill -9 nmPid' i.e. abrupt kill)

bq. so the patch here shouldn't affect the work on YARN-1336 (containers can still be running
after "shutdown" NM, which is different from decommission).
I have considered your comment as you mentioned above i.e. excluding the unregister call to
RM when NM recovery is enabled to continue the execution of running containers. I will include
this in the next patch once we get the conclusion of the comments.

>From what I am understanding, now the new flow in your current patch is: when user decommission
a Node, the RM heartbeat back to NM with a SHUTDOWN message, NM prepare service stop and send
a unRegister message to RM (via RPC call) again before it killing itself and RM (ResourceTrackerService)
try to do some cleanup work. 
IMO, there are several concerns with this approach:
1. Another round of RPC between (NM and RM) is unnecessary, RM could do the same thing (code
in unRegisterNodeManager()) during sending SHUTDOWN message back.
2. Some work is already being covered (like sending DECOMMISSION event to RMNode) in NodeListManager
when doing decommission (refresh) node operation. It seems new work in unRegisterNodeManager()
only be unregister in NMLivenessMonitor.
When user decommissions a node, RM sends the SHUTDOWN message to NM as part of heartbeat response
and then NM calls unRegisterNodeManager() to RM during it's NodeStatusUpdaterImpl.serviceStop().
In this case, ResourceTrackerService doesn't perform any action because the node would be
removed already from this.rmContext.getRMNodes() as part of DeactivateNodeTransition for DECOMMISSION
event. I haven't excluded this case for unRegisterNodeManager() because to have the complete
life cycle methods of NM (registerNodeManager, nodeHeartbeat, unRegisterNodeManager).

{code:title= ResourceTrackerService|borderStyle=solid}
      public UnRegisterNodeManagerResponse unRegisterNodeManager(
      UnRegisterNodeManagerRequest request) throws YarnException, IOException {
    UnRegisterNodeManagerResponse response = recordFactory
    NodeId nodeId = request.getNodeId();
    RMNode rmNode = this.rmContext.getRMNodes().get(nodeId);
    if (rmNode == null) {
      LOG.info("Node not found, ignoring the unregister from node id : "
          + nodeId);
      return response;

> The RM should handle the graceful shutdown of the NM.
> -----------------------------------------------------
>                 Key: YARN-41
>                 URL: https://issues.apache.org/jira/browse/YARN-41
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, resourcemanager
>            Reporter: Ravi Teja Ch N V
>            Assignee: Devaraj K
>         Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch,
YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41.patch
> Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown

This message was sent by Atlassian JIRA

View raw message