hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1343) NodeManagers additions/restarts are not reported as node updates in AllocateResponse responses to AMs
Date Wed, 30 Oct 2013 20:21:26 GMT

    [ https://issues.apache.org/jira/browse/YARN-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809565#comment-13809565

Bikas Saha commented on YARN-1343:

It looks like in the reconnect with different capacity case we will end up sending 2 NODE_USABLE
events for the same node.
        rmNode.context.getRMNodes().put(newNode.getNodeID(), newNode);
            new RMNodeEvent(newNode.getNodeID(), RMNodeEventType.STARTED)); // <=== First
instance when this triggers the ADD_NODE_Transition
          new NodesListManagerEvent(
              NodesListManagerEventType.NODE_USABLE, rmNode)); // <=== Second instance

So we could probably move the second instance to the first if-stmt where it also sends the
NodeAddedSchedulerEvent. That would handle the case of the same node coming back while the
STARTED event in the else stmt will cover the case of a different node with the same node
name coming back (same as a new node being added).
if (rmNode.getTotalCapability().equals(newNode.getTotalCapability())
          && rmNode.getHttpPort() == newNode.getHttpPort()) {
        // Reset heartbeat ID since node just restarted.
        if (rmNode.getState() != NodeState.UNHEALTHY) {
          // Only add new node if old state is not UNHEALTHY
              new NodeAddedSchedulerEvent(rmNode));

I modified the patch testcase to try out reconnect with different capability and the above
issue showed up.

> NodeManagers additions/restarts are not reported as node updates in AllocateResponse
responses to AMs
> -----------------------------------------------------------------------------------------------------
>                 Key: YARN-1343
>                 URL: https://issues.apache.org/jira/browse/YARN-1343
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>            Priority: Critical
>             Fix For: 2.2.1
>         Attachments: YARN-1343.patch, YARN-1343.patch
> If a NodeManager joins the cluster or gets restarted, running AMs never receive the node
update indicating the Node is running.

This message was sent by Atlassian JIRA

View raw message