hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1885) RM may not send the finished signal to some nodes where the application ran after RM restarts
Date Tue, 29 Apr 2014 20:05:19 GMT

    [ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984722#comment-13984722
] 

Jian He commented on YARN-1885:
-------------------------------

Thanks for the update!
- some places exceed the 80 column limit, like the RMAppImpl transitions.
- app.isAppFinalStateStored() better use isAppInFinalState instead ?
- sleeping for a fixed amount time is not deterministic, test may fail randomly. it’s better
doing it in a while loop with heartbeats, and exit out of the loop if condition meets.
{code}
    // sleep for a while before do next heartbeat
    Thread.sleep(1000);
    NodeHeartbeatResponse response = nm1.nodeHeartbeat(true);
{code}
- timeout = 600000, timeout too long.
- these two transitions cannot happen? Generally, we should not add events to states where
the transitions can never happen, that’ll hide bugs.
{code}
    .addTransition(RMAppState.NEW, RMAppState.NEW, RMAppEventType.NODE_ADDED,
        new NodeAddedTransition())
    .addTransition(RMAppState.NEW_SAVING, RMAppState.NEW_SAVING, RMAppEventType.NODE_ADDED,
        new NodeAddedTransition())
{code}
- These two loops may block the register RPC call for a while, I think we may send them as
the payload of RMNodeStartEvent and handle them in RMNodeAddTransition ?
{code}
    // Handle container statuses reported by NM
    if (!request.getContainerStatuses().isEmpty()) {
      LOG.info("received container statuses on node manager register :"
          + request.getContainerStatuses());
      for (ContainerStatus containerStatus : request.getContainerStatuses()) {
        handleContainerStatus(containerStatus);
      }
    }
    
    // Handle running applications reported by NM
    if (null != request.getRunningApplications()) {
      for (ApplicationId appId : request.getRunningApplications()) {
        handleRunningAppOnNode(appId, request.getNodeId());
      }
    }
{code}

> RM may not send the finished signal to some nodes where the application ran after RM
restarts
> ---------------------------------------------------------------------------------------------
>
>                 Key: YARN-1885
>                 URL: https://issues.apache.org/jira/browse/YARN-1885
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Wangda Tan
>         Attachments: YARN-1885.patch, YARN-1885.patch, YARN-1885.patch
>
>
> During our HA testing we have seen cases where yarn application logs are not available
through the cli but i can look at AM logs through the UI. RM was also being restarted in the
background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message