Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ajohnson@etsy.com designates
 74.125.82.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOGNzB=gQov19reVwxypNULjb3mi5F-GrogL+5YZORNa5zPXoQ@mail.gmail.com>
References: 
 <CAOGNzB=T6AQd8NDA+30GYb20w5h+NKNbCyYeOHsAa0+RDWb1WQ@mail.gmail.com>
	<CAOGNzB=gQov19reVwxypNULjb3mi5F-GrogL+5YZORNa5zPXoQ@mail.gmail.com>
Date: Wed, 18 Mar 2015 09:57:14 -0400
Message-ID: 
 <CAOGNzB=3zG94uQ=78Mm8LaR-TZvBGjO=63x-2Sg_0W4+p4HPpQ@mail.gmail.com>
Subject: Re: ApplicationMaster Retrying Connection to Dead Node
From: Andrew Johnson <ajohnson@etsy.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11c3476ae2c09505119075da

--001a11c3476ae2c09505119075da
Content-Type: text/plain; charset=UTF-8

I've tracked down the cause of the problem I was experiencing.

There are two levels of retries that were coming into play here.  The first
is controlled by the setting ipc.client.connect.max.retries.on.timeouts.  I
have this set to 20.  This is used by org.apache.hadoop.ipc.Client when it
is attempting to connect to the dead node.  I observed about twenty seconds
between each of these retries, giving a total of about 7 minutes of time
spent attempting to connect.

When that retry limit is reached, the IPC client throws a
ConnectTimeoutException.  This propagates upwards to the
RetryInvocationHandler, which is using a different retry policy, created by
the NMProxy class.  This retry policy is controlled by two properties:
 yarn.client.nodemanager-connect.max-wait-ms and
yarn.client.nodemanager-connect.retry-interval-ms.
I had these set to 300000 and 10000, respectively.  Both the names and the
code suggest that this would set a maximum upper bound on the time spent
retrying.  When a ConnectTimeoutException is thrown, a
RetryUpToMaximumTimeWithFixedSleep policy is used.  However, there is not
actually a maximum time limit.  Instead, the value of
yarn.client.nodemanager-connect.max-wait-ms is divided by the value of
yarn.client.nodemanager-connect.retry-interval-ms to compute a total number
of retries, regardless of how long it takes.  In my case this produced 30
total retries, with 10 seconds between each.  With about 7 minutes per
retry, this meant the AM would spend around 3.5 hours total attempting to
connect to the dead node, which lines up well with the observed behavior.

I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to
20000, so there would only be two retries at the higher level.  This brings
the total time spent by the AM attempting to connect to a dead down to
around 15 minutes.

There is also a yarn.resourcemanager.connect.max-wait.ms property that
appears to behave the same way.  I've opened a JIRA to clarify the naming
and documentation of these configuration properties:
https://issues.apache.org/jira/browse/YARN-3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <ajohnson@etsy.com> wrote:

> I had tried applying the patch from
> https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed
> somewhat relevant.  Unfortunately that did not fix my issue.
>
> Does anyone have any other suggestions for how to resolve this?
>
> On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson <ajohnson@etsy.com> wrote:
>
>> Hey everyone,
>>
>> I have encountered a troubling issue caused by a node in my cluster
>> dying.  I had a node die due to a hardware issue while several MR jobs were
>> running on the cluster, which is running YARN.  I noticed that these jobs
>> took over four hours longer than expected to finish.  After investigating I
>> found that the ApplicationMaster for these jobs was retrying to connect to
>> the node that had died for those four hours.  I see this repeated in the AM
>> logs for that entire period:
>>
>> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
>> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
>> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
>> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
>> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
>> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
>> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
>> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
>> org.apache.hadoop.ipc.Client: Retrying connect to server:
>> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
>>
>> Eventually the following exception appeared in the AM logs and the job
>> completed successfully:
>> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
>> container container_1423675803126_127109_01_004637 :
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
>> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]; For more details see:
>> http://wiki.apache.org/hadoop/SocketTimeout
>> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1415)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
>> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> at
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
>> at
>> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>> timeout while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=dead.node.host/dead.node.ip:47936]
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
>> ... 15 more
>>
>> It looks to me that the tasks that had been running on the dead node were
>> restarted, and the AM attempting to clean up those tasks.  However, since
>> the node was dead it would not be able to connect.
>>
>> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
>> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
>> it retry the connection 20 times in the logs, but then it starts retrying
>> from 0 again.  Also, I would expect the AM to give up the attempt to
>> connect much sooner.  For instance, the ResourceManager recognized the node
>> as dead after 10 minutes as expected.  I'd like to see the AM do the same.
>>
>> Has anyone encountered this behavior before?
>>
>> Thanks!
>>
>> --
>> Andrew Johnson
>>
>
>
>
> --
> Andrew Johnson
> Software Engineer, Etsy
>


-- 
Andrew Johnson
Software Engineer, Etsy

--001a11c3476ae2c09505119075da
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;ve tracked down the cause of the problem I was exper=
iencing.<div><br></div><div>There are two levels of retries that were comin=
g into play here. =C2=A0<span style=3D"font-size:12.8000001907349px">The fi=
rst is controlled by the setting ipc.client.connect.max.</span><span style=
=3D"font-size:12.8000001907349px">retries.on.timeouts.=C2=A0 I have this se=
t to 20.=C2=A0 This is used by org.apache.hadoop.ipc.Client when it is atte=
mpting to connect to the dead node.=C2=A0 I observed about twenty seconds b=
etween each of these retries, giving a total of about 7 minutes of time spe=
nt attempting to connect.</span></div><div><span style=3D"font-size:12.8000=
001907349px"><br></span></div><div><span style=3D"font-size:12.800000190734=
9px">When that retry limit is reached, the IPC client throws a ConnectTimeo=
utException.=C2=A0 This propagates upwards to the RetryInvocationHandler, w=
hich is using a different retry policy, created by the NMProxy class.=C2=A0=
 This retry policy is controlled by two properties:</span><span style=3D"fo=
nt-size:12.8000001907349px">=C2=A0yarn.client.nodemanager-</span><span styl=
e=3D"font-size:12.8000001907349px">connect.max-wait-ms and yarn.client.node=
manager-</span><span style=3D"font-size:12.8000001907349px">connect.retry-i=
nterval-ms.=C2=A0 I had these set to 300000 and 10000, respectively.=C2=A0 =
Both the names and the code suggest that this would set a maximum upper bou=
nd on the time spent retrying.=C2=A0 When a ConnectTimeoutException is thro=
wn, a RetryUpToMaximumTimeWithFixedSleep policy is used.=C2=A0 However, the=
re is not actually a maximum time limit.=C2=A0 Instead, the value of=C2=A0<=
/span><span style=3D"font-size:12.8000001907349px">yarn.client.nodemanager-=
</span><span style=3D"font-size:12.8000001907349px">connect.max-wait-ms is =
divided by the value of=C2=A0</span><span style=3D"font-size:12.80000019073=
49px">yarn.client.nodemanager-</span><span style=3D"font-size:12.8000001907=
349px">connect.retry-interval-ms to compute a total number of retries, rega=
rdless of how long it takes.=C2=A0 In my case this produced 30 total retrie=
s, with 10 seconds between each.=C2=A0 With about 7 minutes per retry, this=
 meant the AM would spend around 3.5 hours total attempting to connect to t=
he dead node, which lines up well with the observed behavior.</span></div><=
div><span style=3D"font-size:12.8000001907349px"><br></span></div><div><spa=
n style=3D"font-size:12.8000001907349px">I fixed this by changing=C2=A0</sp=
an><span style=3D"font-size:12.8000001907349px">yarn.client.nodemanager-</s=
pan><span style=3D"font-size:12.8000001907349px">connect.max-wait-ms to 200=
00, so there would only be two retries at the higher level.=C2=A0 This brin=
gs the total time spent by the AM attempting to connect to a dead down to a=
round 15 minutes.</span></div><div><span style=3D"font-size:12.800000190734=
9px"><br></span></div><div><span style=3D"font-size:12.8000001907349px">The=
re is also a <a href=3D"http://yarn.resourcemanager.connect.max-wait.ms">ya=
rn.resourcemanager.connect.max-wait.ms</a> property that appears to behave =
the same way.=C2=A0 I&#39;ve opened a JIRA to clarify the naming and docume=
ntation of these configuration properties:=C2=A0<a href=3D"https://issues.a=
pache.org/jira/browse/YARN-3364">https://issues.apache.org/jira/browse/YARN=
-3364</a></span></div></div><div class=3D"gmail_extra"><br><div class=3D"gm=
ail_quote">On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <span dir=3D"lt=
r">&lt;<a href=3D"mailto:ajohnson@etsy.com" target=3D"_blank">ajohnson@etsy=
.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r">I had tried applying the patch from=C2=A0<a href=3D"https://issues.apach=
e.org/jira/browse/HADOOP-6221" target=3D"_blank">https://issues.apache.org/=
jira/browse/HADOOP-6221</a>, as that seemed somewhat relevant.=C2=A0 Unfort=
unately that did not fix my issue.<div><br></div><div>Does anyone have any =
other suggestions for how to resolve this?</div></div><div class=3D"gmail_e=
xtra"><div><div class=3D"h5"><br><div class=3D"gmail_quote">On Sat, Mar 14,=
 2015 at 9:56 AM, Andrew Johnson <span dir=3D"ltr">&lt;<a href=3D"mailto:aj=
ohnson@etsy.com" target=3D"_blank">ajohnson@etsy.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hey everyone,<div><br></=
div><div>I have encountered a troubling issue caused by a node in my cluste=
r dying.=C2=A0 I had a node die due to a hardware issue while several MR jo=
bs were running on the cluster, which is running YARN.=C2=A0 I noticed that=
 these jobs took over four hours longer than expected to finish.=C2=A0 Afte=
r investigating I found that the ApplicationMaster for these jobs was retry=
ing to connect to the node that had died for those four hours.=C2=A0 I see =
this repeated in the AM logs for that entire period:</div><div><br></div><d=
iv><div><font face=3D"monospace, monospace">2015-03-14 07:07:28,435 INFO [C=
ontainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to se=
rver: dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetri=
es=3D20</font></div><div><font face=3D"monospace, monospace">2015-03-14 07:=
07:28,545 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: Retry=
ing connect to server: dead.node.host/dead.node.ip:47936. Already tried 19 =
time(s); maxRetries=3D20</font></div><div><font face=3D"monospace, monospac=
e">2015-03-14 07:07:29,202 INFO [ContainerLauncher #261] org.apache.hadoop.=
ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. =
Already tried 0 time(s); maxRetries=3D20</font></div><div><font face=3D"mon=
ospace, monospace">2015-03-14 07:07:31,074 INFO [ContainerLauncher #283] or=
g.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead=
.node.ip:47936. Already tried 1 time(s); maxRetries=3D20</font></div><div><=
font face=3D"monospace, monospace">2015-03-14 07:07:31,110 INFO [ContainerL=
auncher #278] org.apache.hadoop.ipc.Client: Retrying connect to server: dea=
d.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=3D20</f=
ont></div><div><font face=3D"monospace, monospace">2015-03-14 07:07:46,093 =
INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retrying connec=
t to server: dead.node.host/dead.node.ip:47936. Already tried 17 time(s); m=
axRetries=3D20</font></div><div><font face=3D"monospace, monospace">2015-03=
-14 07:07:48,455 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client=
: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tr=
ied 17 time(s); maxRetries=3D20</font></div><div><font face=3D"monospace, m=
onospace">2015-03-14 07:07:49,223 INFO [ContainerLauncher #261] org.apache.=
hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:=
47936. Already tried 1 time(s); maxRetries=3D20</font></div><div><font face=
=3D"monospace, monospace">2015-03-14 07:07:51,095 INFO [ContainerLauncher #=
283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.ho=
st/dead.node.ip:47936. Already tried 2 time(s); maxRetries=3D20</font></div=
><div><font face=3D"monospace, monospace">2015-03-14 07:07:51,116 INFO [Con=
tainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect to serv=
er: dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=
=3D20</font></div><div><font face=3D"monospace, monospace">2015-03-14 07:08=
:06,097 INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retryin=
g connect to server: dead.node.host/dead.node.ip:47936. Already tried 18 ti=
me(s); maxRetries=3D20</font></div><div><font face=3D"monospace, monospace"=
>2015-03-14 07:08:08,476 INFO [ContainerLauncher #293] org.apache.hadoop.ip=
c.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Al=
ready tried 18 time(s); maxRetries=3D20</font></div><div><font face=3D"mono=
space, monospace">2015-03-14 07:08:09,243 INFO [ContainerLauncher #261] org=
.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.=
node.ip:47936. Already tried 2 time(s); maxRetries=3D20</font></div><div><f=
ont face=3D"monospace, monospace">2015-03-14 07:08:11,115 INFO [ContainerLa=
uncher #283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead=
.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=3D20</fo=
nt></div><div><font face=3D"monospace, monospace">2015-03-14 07:08:11,120 I=
NFO [ContainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect=
 to server: dead.node.host/dead.node.ip:47936. Already tried 2 time(s); max=
Retries=3D20</font></div><div><font face=3D"monospace, monospace">2015-03-1=
4 07:08:18,569 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: =
Retrying connect to server: dead.node.host/dead.node.ip:47936. Already trie=
d 0 time(s); maxRetries=3D20</font></div><div><font face=3D"monospace, mono=
space">2015-03-14 07:08:26,118 INFO [ContainerLauncher #167] org.apache.had=
oop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:479=
36. Already tried 19 time(s); maxRetries=3D20</font></div><div><font face=
=3D"monospace, monospace">2015-03-14 07:08:28,495 INFO [ContainerLauncher #=
293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.ho=
st/dead.node.ip:47936. Already tried 19 time(s); maxRetries=3D20</font></di=
v><div><font face=3D"monospace, monospace">2015-03-14 07:08:29,264 INFO [Co=
ntainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to ser=
ver: dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=
=3D20</font></div></div><div><br></div><div>Eventually the following except=
ion appeared in the AM logs and the job completed successfully:</div><div><=
div><font face=3D"monospace, monospace">2015-03-14 07:23:09,239 INFO [Async=
Dispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskA=
ttemptImpl: Diagnostics report from attempt_1423675803126_127109_m_001910_0=
: cleanup failed for container container_1423675803126_127109_01_004637 : o=
rg.apache.hadoop.net.ConnectTimeoutException: Call From am.node.host/am.nod=
e.ip to dead.node.host:47936 failed on socket timeout exception: org.apache=
.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for=
 channel to be ready for connect. ch : java.nio.channels.SocketChannel[conn=
ection-pending remote=3Ddead.node.host/dead.node.ip:47936]; For more detail=
s see: =C2=A0<a href=3D"http://wiki.apache.org/hadoop/SocketTimeout" target=
=3D"_blank">http://wiki.apache.org/hadoop/SocketTimeout</a></font></div><di=
v><font face=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">=
	</span>at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown S=
ource)</font></div><div><font face=3D"monospace, monospace"><span style=3D"=
white-space:pre-wrap">	</span>at sun.reflect.DelegatingConstructorAccessorI=
mpl.newInstance(DelegatingConstructorAccessorImpl.java:45)</font></div><div=
><font face=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">	=
</span>at java.lang.reflect.Constructor.newInstance(Constructor.java:526)</=
font></div><div><font face=3D"monospace, monospace"><span style=3D"white-sp=
ace:pre-wrap">	</span>at org.apache.hadoop.net.NetUtils.wrapWithMessage(Net=
Utils.java:783)</font></div><div><font face=3D"monospace, monospace"><span =
style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.net.NetUtils.wr=
apException(NetUtils.java:749)</font></div><div><font face=3D"monospace, mo=
nospace"><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.=
ipc.Client.call(Client.java:1415)</font></div><div><font face=3D"monospace,=
 monospace"><span style=3D"white-space:pre-wrap">	</span>at org.apache.hado=
op.ipc.Client.call(Client.java:1364)</font></div><div><font face=3D"monospa=
ce, monospace"><span style=3D"white-space:pre-wrap">	</span>at org.apache.h=
adoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)</fon=
t></div><div><font face=3D"monospace, monospace"><span style=3D"white-space=
:pre-wrap">	</span>at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)=
</font></div><div><font face=3D"monospace, monospace"><span style=3D"white-=
space:pre-wrap">	</span>at org.apache.hadoop.yarn.api.impl.pb.client.Contai=
nerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtoco=
lPBClientImpl.java:110)</font></div><div><font face=3D"monospace, monospace=
"><span style=3D"white-space:pre-wrap">	</span>at sun.reflect.GeneratedMeth=
odAccessor14.invoke(Unknown Source)</font></div><div><font face=3D"monospac=
e, monospace"><span style=3D"white-space:pre-wrap">	</span>at sun.reflect.D=
elegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)</f=
ont></div><div><font face=3D"monospace, monospace"><span style=3D"white-spa=
ce:pre-wrap">	</span>at java.lang.reflect.Method.invoke(Method.java:606)</f=
ont></div><div><font face=3D"monospace, monospace"><span style=3D"white-spa=
ce:pre-wrap">	</span>at org.apache.hadoop.io.retry.RetryInvocationHandler.i=
nvokeMethod(RetryInvocationHandler.java:187)</font></div><div><font face=3D=
"monospace, monospace"><span style=3D"white-space:pre-wrap">	</span>at org.=
apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler=
.java:102)</font></div><div><font face=3D"monospace, monospace"><span style=
=3D"white-space:pre-wrap">	</span>at com.sun.proxy.$Proxy40.stopContainers(=
Unknown Source)</font></div><div><font face=3D"monospace, monospace"><span =
style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.mapreduce.v2.ap=
p.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:=
206)</font></div><div><font face=3D"monospace, monospace"><span style=3D"wh=
ite-space:pre-wrap">	</span>at org.apache.hadoop.mapreduce.v2.app.launcher.=
ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)</f=
ont></div><div><font face=3D"monospace, monospace"><span style=3D"white-spa=
ce:pre-wrap">	</span>at java.util.concurrent.ThreadPoolExecutor.runWorker(T=
hreadPoolExecutor.java:1145)</font></div><div><font face=3D"monospace, mono=
space"><span style=3D"white-space:pre-wrap">	</span>at java.util.concurrent=
.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)</font></div><di=
v><font face=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">=
	</span>at java.lang.Thread.run(Thread.java:745)</font></div><div><font fac=
e=3D"monospace, monospace">Caused by: org.apache.hadoop.net.ConnectTimeoutE=
xception: 20000 millis timeout while waiting for channel to be ready for co=
nnect. ch : java.nio.channels.SocketChannel[connection-pending remote=3Ddea=
d.node.host/dead.node.ip:47936]</font></div><div><font face=3D"monospace, m=
onospace"><span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop=
.net.NetUtils.connect(NetUtils.java:532)</font></div><div><font face=3D"mon=
ospace, monospace"><span style=3D"white-space:pre-wrap">	</span>at org.apac=
he.hadoop.net.NetUtils.connect(NetUtils.java:493)</font></div><div><font fa=
ce=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">	</span>at=
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)</=
font></div><div><font face=3D"monospace, monospace"><span style=3D"white-sp=
ace:pre-wrap">	</span>at org.apache.hadoop.ipc.Client$Connection.setupIOstr=
eams(Client.java:700)</font></div><div><font face=3D"monospace, monospace">=
<span style=3D"white-space:pre-wrap">	</span>at org.apache.hadoop.ipc.Clien=
t$Connection.access$2800(Client.java:367)</font></div><div><font face=3D"mo=
nospace, monospace"><span style=3D"white-space:pre-wrap">	</span>at org.apa=
che.hadoop.ipc.Client.getConnection(Client.java:1463)</font></div><div><fon=
t face=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">	</spa=
n>at org.apache.hadoop.ipc.Client.call(Client.java:1382)</font></div><div><=
font face=3D"monospace, monospace"><span style=3D"white-space:pre-wrap">	</=
span>... 15 more</font></div></div><div><br></div><div>It looks to me that =
the tasks that had been running on the dead node were restarted, and the AM=
 attempting to clean up those tasks.=C2=A0 However, since the node was dead=
 it would not be able to connect.</div><div><br></div><div>I have=C2=A0yarn=
.client.nodemanager-connect.max-wait-ms set to=C2=A0300000 (5 minutes) and=
=C2=A0ipc.client.connect.max.retries.on.timeouts set to 20.=C2=A0 I see it =
retry the connection 20 times in the logs, but then it starts retrying from=
 0 again.=C2=A0 Also, I would expect the AM to give up the attempt to conne=
ct much sooner.=C2=A0 For instance, the ResourceManager recognized the node=
 as dead after 10 minutes as expected.=C2=A0 I&#39;d like to see the AM do =
the same.</div><div><br></div><div>Has anyone encountered this behavior bef=
ore?</div><div><br></div><div>Thanks!<span><font color=3D"#888888"><br clea=
r=3D"all"><div><br></div>-- <br><div><div dir=3D"ltr">Andrew Johnson</div><=
/div>
</font></span></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"HOEnZb"><font color=3D"#888888">-- <br><div><div dir=3D"ltr">Andrew=
 Johnson<div>Software Engineer, Etsy</div></div></div>
</font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature"><div dir=3D"ltr">Andrew Johnson<div>Software Engineer,=
 Etsy</div></div></div>
</div>

--001a11c3476ae2c09505119075da--