hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
Date Fri, 16 May 2014 10:54:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999094#comment-13999094

Jian He commented on YARN-2065:

Looked at the exception posted in SLIDER-34, the problem is that  AM can get new containers
from RM, but cannot launch the containers on NM because of the following method.
The token is generated with the previous container's attempt Id, instead of the current attemptId.
And NM is checking the attemptId from NMToken against the attemptId from the container.
  public NMToken createAndGetNMToken(String applicationSubmitter,
      ApplicationAttemptId appAttemptId, Container container) {
    try {
      HashSet<NodeId> nodeSet = this.appAttemptToNodeKeyMap.get(appAttemptId);
      NMToken nmToken = null;
      if (nodeSet != null) {
        if (!nodeSet.contains(container.getNodeId())) {
          LOG.info("Sending NMToken for nodeId : " + container.getNodeId()
              + " for container : " + container.getId());
          Token token =
                container.getNodeId(), applicationSubmitter);
          nmToken = NMToken.newInstance(container.getNodeId(), token);
      return nmToken;
    } finally {
Changing this method will fix this problem. 

But another problem is that ContainerMangerImpl#authorizeGetAndStopContainerRequest also requires
the previous NMToken to talk to the previous container and current NMToken to talk with current
container. Luckily, it's now not throwing exception but just log error messages.  we also
need to change the NM side to check against the applicationId rather than attemptId. 

> AM cannot create new containers after restart-NM token from previous attempt used
> ---------------------------------------------------------------------------------
>                 Key: YARN-2065
>                 URL: https://issues.apache.org/jira/browse/YARN-2065
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Steve Loughran
> Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new
> The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the
AM, then kills a container while the AM is down, which triggers a reallocation of a container,
leading to this failure.

This message was sent by Atlassian JIRA

View raw message