hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Johnson <ajohn...@etsy.com>
Subject Re: ApplicationMaster Retrying Connection to Dead Node
Date Tue, 17 Mar 2015 15:05:02 GMT
I had tried applying the patch from
https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed somewhat
relevant.  Unfortunately that did not fix my issue.

Does anyone have any other suggestions for how to resolve this?

On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <ajohnson@etsy.com> wrote:

> Hey everyone,
>
> I have encountered a troubling issue caused by a node in my cluster
> dying.  I had a node die due to a hardware issue while several MR jobs were
> running on the cluster, which is running YARN.  I noticed that these jobs
> took over four hours longer than expected to finish.  After investigating I
> found that the ApplicationMaster for these jobs was retrying to connect to
> the node that had died for those four hours.  I see this repeated in the AM
> logs for that entire period:
>
> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
> org.apache.hadoop.ipc.Client: Retrying connect to server:
> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>
> Eventually the following exception appeared in the AM logs and the job
> completed successfully:
> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
> container container_1423675803126_127109_01_004637 :
> org.apache.hadoop.net.ConnectTimeoutException: Call From
> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
> at
> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=dead.node.host/dead.node.ip:47936]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> ... 15 more
>
> It looks to me that the tasks that had been running on the dead node were
> restarted, and the AM attempting to clean up those tasks.  However, since
> the node was dead it would not be able to connect.
>
> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
> it retry the connection 20 times in the logs, but then it starts retrying
> from 0 again.  Also, I would expect the AM to give up the attempt to
> connect much sooner.  For instance, the ResourceManager recognized the node
> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>
> Has anyone encountered this behavior before?
>
> Thanks!
>
> --
> Andrew Johnson
>



-- 
Andrew Johnson
Software Engineer, Etsy

Mime
View raw message