hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3809) Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
Date Tue, 16 Jun 2015 06:48:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14587555#comment-14587555
] 

Jun Gong commented on YARN-3809:
--------------------------------

The stack is as following:
{noformat}
2015-06-15 11:16:35,376 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Error cleaning master 
org.apache.hadoop.net.ConnectTimeoutException: Call From docker-10-240-139-221/10.240.139.221
to docker-10-240-139-234:8041 failed on socket timeout exception: org.apache.h
adoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready
for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=do
cker-10-240-139-234/10.240.139.234:8041]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.GeneratedConstructorAccessor107.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
        at org.apache.hadoop.ipc.Client.call(Client.java:1414)
        at org.apache.hadoop.ipc.Client.call(Client.java:1363)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy36.stopContainers(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:138)
        at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:263)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting
for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=docker-10-240-139-234/10.240.139.234:8041]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        ... 9 more
{noformat}

Time out is 20 secs, but it will retry 45 times(IPC_CLIENT_CONNECT_MAX_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT
= 45).

> Failed to launch new attempts because ApplicationMasterLauncher's threads all hang
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-3809
>                 URL: https://issues.apache.org/jira/browse/YARN-3809
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>
> ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH
and CLEANUP).
> In our cluster, there was many NM with 10+ AM running on it, and one shut down for some
reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher
need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled
up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM
was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher
could not handle new event such as LAUNCH, then new attempts will fails to launch because
of time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message