hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3112) AM restart and keep containers from previous attempts, then new container launch failed
Date Fri, 30 Jan 2015 18:22:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299003#comment-14299003
] 

Jack Chen commented on YARN-3112:
---------------------------------

I have found the cause for this error: the new launched appattempt will transfer the old containers
from previous attempts, so the Nodeset in NMTokenSecretManagerInRM.java will be filled. When
the new appattempt to get the allocated containers via  pullNewlyAllocatedContainersAndNMTokens(),
it will get "null" nmToken because of the full Nodeset in createAndGetNMToken(). The Null
nmToken will be returned to the ContainerLauncher, so the new container will fail in the launch.
What i have done is clear the nodeset in  pullNewlyAllocatedContainersAndNMTokens() before
the creation of container and node tokens. 

   public synchronized ContainersAndNMTokensAllocation
  438       pullNewlyAllocatedContainersAndNMTokens() {                                  
                       
  439     List<Container> returnContainerList =
  440         new ArrayList<Container>(newlyAllocatedContainers.size());
  441     List<NMToken> nmTokens = new ArrayList<NMToken>();
+ 442     // clear the nodeset for NMTokens
+ 443     rmContext.getNMTokenSecretManager().clearNodeSetForAttempt(getApplicationAttemptId());
  444     for (Iterator<RMContainer> i = newlyAllocatedContainers.iterator(); i
  445       .hasNext();) {
  446       RMContainer rmContainer = i.next();
  447       Container container = rmContainer.getContainer();
  448       try {
  449         // create container token and NMToken altogether.
  450         container.setContainerToken(rmContext.getContainerTokenSecretManager()
  451           .createContainerToken(container.getId(), container.getNodeId(),
  452             getUser(), container.getResource(), container.getPriority(),
  453             rmContainer.getCreationTime(), this.logAggregationContext));
  454         NMToken nmToken =
  455             rmContext.getNMTokenSecretManager().createAndGetNMToken(getUser(),
  456               getApplicationAttemptId(), container);
+ 457         //check whether nmtoken is null
+ 458         LOG.info("[hchen]NMToken for container "+container.getId()+" NMToken:"+nmToken);
  459         if (nmToken != null) {
  460           nmTokens.add(nmToken);
  461         }
  462       } catch (IllegalArgumentException e) {
  463         // DNS might be down, skip returning this container.
  464         LOG.error("Error trying to assign container token and NM token to" +
  465             " an allocated container " + container.getId(), e);
  466         continue;
  467       }
  468       returnContainerList.add(container);
  469       i.remove();
  470       rmContainer.handle(new RMContainerEvent(rmContainer.getContainerId(),
  471         RMContainerEventType.ACQUIRED));
  472     }
  473     return new ContainersAndNMTokensAllocation(returnContainerList, nmTokens);
  474   }

> AM restart and keep containers from previous attempts, then new container launch failed
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-3112
>                 URL: https://issues.apache.org/jira/browse/YARN-3112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: in real linux cluster
>            Reporter: Jack Chen
>
> This error is very similar to YARN-1795, YARN-1839, but i have check the solution of
those jira, the patches are already included in my version. I think this error is caused by
the different NMTokens between old and new appattempts. New AM has inherited the old tokens
from previous AM according to my configuration (keepContainers=true), so the token for new
containers are replaced by the old one in the NMTokenCache.
> 206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Container launch failed for      container_1422546145900_0001_02_000002 : org.apache.hadoop.security.token.SecretManager$InvalidToken:
No NMToken sent for ixk02:47625
>  207 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProt
    ocolProxy.java:256)
>  208 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtoc
    olProxy.java:246)
>  209 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:132)
>  210 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:401)
>  211 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>  212 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:367)
>  213 ›   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  214 ›   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  215 ›   at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message