hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Johnson <ajohn...@etsy.com>
Subject Re: ApplicationMaster Retrying Connection to Dead Node
Date Wed, 18 Mar 2015 13:57:14 GMT
I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <ajohnson@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <ajohnson@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>



-- 
Andrew Johnson
Software Engineer, Etsy

Mime
View raw message