hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yevhenii Semenov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3112) AM restart and keep containers from previous attempts, then new container launch failed
Date Wed, 14 Dec 2016 17:20:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15748904#comment-15748904

Yevhenii Semenov commented on YARN-3112:

[~xtchenhui],  thanks for you investigation and fix! 

I get a similar issue when I kill AM process by {noformat}kill -9 process_id{noformat} and
RM recovers it. Not sure that I'm dealing with the same problem (root cause), but your fix
helps me too. However, I would like to clarify one important thing. According to the *"Apache
Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2"*: 

As for network optimization, NMTokens are not sent to the ApplicationMasters for each and
every allocated container, but only for the first time or if NMTokens have to be invalidated
due to the rollover of the underlying master key

If you clear node set in _"pullNewlyAllocatedContainersAndNMTokens"_ then RM generates new
tokens for every allocated container. As for me, the fix may cause a regression for network
optimization. What do you think about it? 

I'm going to investigate the issue too. I will update the Jira if I find something interesting.

> AM restart and keep containers from previous attempts, then new container launch failed
> ---------------------------------------------------------------------------------------
>                 Key: YARN-3112
>                 URL: https://issues.apache.org/jira/browse/YARN-3112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: in real linux cluster
>            Reporter: Jack Chen
> This error is very similar to YARN-1795, YARN-1839, but i have check the solution of
those jira, the patches are already included in my version. I think this error is caused by
the different NMTokens between old and new appattempts. New AM has inherited the old tokens
from previous AM according to my configuration (keepContainers=true), so the token for new
containers are replaced by the old one in the NMTokenCache.
> {noformat}
> 206 2015-01-29 10:04:49,603 ERROR [ContainerLauncher #0] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Container launch failed for      container_1422546145900_0001_02_000002 : org.apache.hadoop.security.token.SecretManager$InvalidToken:
No NMToken sent for ixk02:47625
>  207 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProt
>  208 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.<init>(ContainerManagementProtoc
>  209 ›   at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:132)
>  210 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:401)
>  211 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138)
>  212 ›   at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:367)
>  213 ›   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  214 ›   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  215 ›   at java.lang.Thread.run(Thread.java:722)
> {noformat}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message