Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F6A617995 for ; Wed, 18 Mar 2015 13:58:33 +0000 (UTC) Received: (qmail 4265 invoked by uid 500); 18 Mar 2015 13:58:26 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 4150 invoked by uid 500); 18 Mar 2015 13:58:26 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 4140 invoked by uid 99); 18 Mar 2015 13:58:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2015 13:58:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ajohnson@etsy.com designates 74.125.82.176 as permitted sender) Received: from [74.125.82.176] (HELO mail-we0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Mar 2015 13:58:00 +0000 Received: by webcq43 with SMTP id cq43so33015455web.2 for ; Wed, 18 Mar 2015 06:57:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=etsy.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=FcBPkTYH6wLBCaVtYbjUKojCYCa5tbAuBP/ZHqDyZps=; b=R7OL5/hIt8xmP/heBYSjuwjxkilemeV/An1+bsYLqdy3HcXS+HFkI3hZq0SMo3yW7w VjXLhTImWpaJBTS75Qt+ZQX/1N0kdzu90rFL0tYin0KRXH0E6GlC0QlfTTgjJl+H5J4H elivSJaKKGjUSddBtB2BHh6V8J0uV81mq3KWE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=FcBPkTYH6wLBCaVtYbjUKojCYCa5tbAuBP/ZHqDyZps=; b=RfQvvZKZT5QhUQly+as0a1yeeWMLcO6brn6hG64uczAScOFpJgsVQy1WsQ/7NSSnff qW2ZEFz3lXSnBm7jbsUKlImfRAG0JS4aZhjWN4aJpdlpixzSfngtwKgOdm2J4mMJeHAU dnzzz4Frn5iZfiWLyNMv9dqGqMuPB/3drdpPq6hlZMaSA9k8UHcd0Txd7ckye3s8HA3M 2PnvGvB31UoOWHmyuBbwhBGMpMy7ht19jG6aeWKjJuXWz8SYkHkawKSb7uQpqavfbpVv LbQGUDz0UblCWvl9SIVgq/JbpJ4wYdFRLAl/J9oQh4zka31/1DsM9m691L1rjc3PlwbH 9djw== X-Gm-Message-State: ALoCoQnU/Xn3Zc8fJXjfeu/idZbe+kqGv0jJ2MFqYo4xZHxgtw3SbWZEQvwzs8ZZPZ4Fri9ESm0D MIME-Version: 1.0 X-Received: by 10.181.27.201 with SMTP id ji9mr7171672wid.20.1426687034305; Wed, 18 Mar 2015 06:57:14 -0700 (PDT) Received: by 10.28.20.145 with HTTP; Wed, 18 Mar 2015 06:57:14 -0700 (PDT) In-Reply-To: References: Date: Wed, 18 Mar 2015 09:57:14 -0400 Message-ID: Subject: Re: ApplicationMaster Retrying Connection to Dead Node From: Andrew Johnson To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c3476ae2c09505119075da X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3476ae2c09505119075da Content-Type: text/plain; charset=UTF-8 I've tracked down the cause of the problem I was experiencing. There are two levels of retries that were coming into play here. The first is controlled by the setting ipc.client.connect.max.retries.on.timeouts. I have this set to 20. This is used by org.apache.hadoop.ipc.Client when it is attempting to connect to the dead node. I observed about twenty seconds between each of these retries, giving a total of about 7 minutes of time spent attempting to connect. When that retry limit is reached, the IPC client throws a ConnectTimeoutException. This propagates upwards to the RetryInvocationHandler, which is using a different retry policy, created by the NMProxy class. This retry policy is controlled by two properties: yarn.client.nodemanager-connect.max-wait-ms and yarn.client.nodemanager-connect.retry-interval-ms. I had these set to 300000 and 10000, respectively. Both the names and the code suggest that this would set a maximum upper bound on the time spent retrying. When a ConnectTimeoutException is thrown, a RetryUpToMaximumTimeWithFixedSleep policy is used. However, there is not actually a maximum time limit. Instead, the value of yarn.client.nodemanager-connect.max-wait-ms is divided by the value of yarn.client.nodemanager-connect.retry-interval-ms to compute a total number of retries, regardless of how long it takes. In my case this produced 30 total retries, with 10 seconds between each. With about 7 minutes per retry, this meant the AM would spend around 3.5 hours total attempting to connect to the dead node, which lines up well with the observed behavior. I fixed this by changing yarn.client.nodemanager-connect.max-wait-ms to 20000, so there would only be two retries at the higher level. This brings the total time spent by the AM attempting to connect to a dead down to around 15 minutes. There is also a yarn.resourcemanager.connect.max-wait.ms property that appears to behave the same way. I've opened a JIRA to clarify the naming and documentation of these configuration properties: https://issues.apache.org/jira/browse/YARN-3364 On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson wrote: > I had tried applying the patch from > https://issues.apache.org/jira/browse/HADOOP-6221, as that seemed > somewhat relevant. Unfortunately that did not fix my issue. > > Does anyone have any other suggestions for how to resolve this? > > On Sat, Mar 14, 2015 at 9:56 AM, Andrew Johnson wrote: > >> Hey everyone, >> >> I have encountered a troubling issue caused by a node in my cluster >> dying. I had a node die due to a hardware issue while several MR jobs were >> running on the cluster, which is running YARN. I noticed that these jobs >> took over four hours longer than expected to finish. After investigating I >> found that the ApplicationMaster for these jobs was retrying to connect to >> the node that had died for those four hours. I see this repeated in the AM >> logs for that entire period: >> >> 2015-03-14 07:07:28,435 INFO [ContainerLauncher #293] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20 >> 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 >> 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 >> 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 >> 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 >> 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20 >> 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20 >> 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 >> 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 >> 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 >> 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20 >> 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20 >> 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 >> 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20 >> 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 >> 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 >> 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 >> 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 >> 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261] >> org.apache.hadoop.ipc.Client: Retrying connect to server: >> dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20 >> >> Eventually the following exception appeared in the AM logs and the job >> completed successfully: >> 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler] >> org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics >> report from attempt_1423675803126_127109_m_001910_0: cleanup failed for >> container container_1423675803126_127109_01_004637 : >> org.apache.hadoop.net.ConnectTimeoutException: Call From >> am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout >> exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis >> timeout while waiting for channel to be ready for connect. ch : >> java.nio.channels.SocketChannel[connection-pending >> remote=dead.node.host/dead.node.ip:47936]; For more details see: >> http://wiki.apache.org/hadoop/SocketTimeout >> at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source) >> at >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) >> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) >> at org.apache.hadoop.ipc.Client.call(Client.java:1415) >> at org.apache.hadoop.ipc.Client.call(Client.java:1364) >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> at com.sun.proxy.$Proxy39.stopContainers(Unknown Source) >> at >> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) >> at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) >> at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> at com.sun.proxy.$Proxy40.stopContainers(Unknown Source) >> at >> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206) >> at >> org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis >> timeout while waiting for channel to be ready for connect. ch : >> java.nio.channels.SocketChannel[connection-pending >> remote=dead.node.host/dead.node.ip:47936] >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) >> at >> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606) >> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700) >> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) >> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463) >> at org.apache.hadoop.ipc.Client.call(Client.java:1382) >> ... 15 more >> >> It looks to me that the tasks that had been running on the dead node were >> restarted, and the AM attempting to clean up those tasks. However, since >> the node was dead it would not be able to connect. >> >> I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5 >> minutes) and ipc.client.connect.max.retries.on.timeouts set to 20. I see >> it retry the connection 20 times in the logs, but then it starts retrying >> from 0 again. Also, I would expect the AM to give up the attempt to >> connect much sooner. For instance, the ResourceManager recognized the node >> as dead after 10 minutes as expected. I'd like to see the AM do the same. >> >> Has anyone encountered this behavior before? >> >> Thanks! >> >> -- >> Andrew Johnson >> > > > > -- > Andrew Johnson > Software Engineer, Etsy > -- Andrew Johnson Software Engineer, Etsy --001a11c3476ae2c09505119075da Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I've tracked down the cause of the problem I was exper= iencing.

There are two levels of retries that were comin= g into play here. =C2=A0The fi= rst is controlled by the setting ipc.client.connect.max.retries.on.timeouts.=C2=A0 I have this se= t to 20.=C2=A0 This is used by org.apache.hadoop.ipc.Client when it is atte= mpting to connect to the dead node.=C2=A0 I observed about twenty seconds b= etween each of these retries, giving a total of about 7 minutes of time spe= nt attempting to connect.

When that retry limit is reached, the IPC client throws a ConnectTimeo= utException.=C2=A0 This propagates upwards to the RetryInvocationHandler, w= hich is using a different retry policy, created by the NMProxy class.=C2=A0= This retry policy is controlled by two properties:=C2=A0yarn.client.nodemanager-connect.max-wait-ms and yarn.client.node= manager-connect.retry-i= nterval-ms.=C2=A0 I had these set to 300000 and 10000, respectively.=C2=A0 = Both the names and the code suggest that this would set a maximum upper bou= nd on the time spent retrying.=C2=A0 When a ConnectTimeoutException is thro= wn, a RetryUpToMaximumTimeWithFixedSleep policy is used.=C2=A0 However, the= re is not actually a maximum time limit.=C2=A0 Instead, the value of=C2=A0<= /span>yarn.client.nodemanager-= connect.max-wait-ms is = divided by the value of=C2=A0yarn.client.nodemanager-connect.retry-interval-ms to compute a total number of retries, rega= rdless of how long it takes.=C2=A0 In my case this produced 30 total retrie= s, with 10 seconds between each.=C2=A0 With about 7 minutes per retry, this= meant the AM would spend around 3.5 hours total attempting to connect to t= he dead node, which lines up well with the observed behavior.
<= div>
I fixed this by changing=C2=A0yarn.client.nodemanager-connect.max-wait-ms to 200= 00, so there would only be two retries at the higher level.=C2=A0 This brin= gs the total time spent by the AM attempting to connect to a dead down to a= round 15 minutes.

The= re is also a ya= rn.resourcemanager.connect.max-wait.ms property that appears to behave = the same way.=C2=A0 I've opened a JIRA to clarify the naming and docume= ntation of these configuration properties:=C2=A0https://issues.apache.org/jira/browse/YARN= -3364

On Tue, Mar 17, 2015 at 11:05 AM, Andrew Johnson <ajohnson@etsy= .com> wrote:
I had tried applying the patch from=C2=A0https://issues.apache.org/= jira/browse/HADOOP-6221, as that seemed somewhat relevant.=C2=A0 Unfort= unately that did not fix my issue.

Does anyone have any = other suggestions for how to resolve this?

On Sat, Mar 14,= 2015 at 9:56 AM, Andrew Johnson <ajohnson@etsy.com> wrote:<= br>
Hey everyone,

I have encountered a troubling issue caused by a node in my cluste= r dying.=C2=A0 I had a node die due to a hardware issue while several MR jo= bs were running on the cluster, which is running YARN.=C2=A0 I noticed that= these jobs took over four hours longer than expected to finish.=C2=A0 Afte= r investigating I found that the ApplicationMaster for these jobs was retry= ing to connect to the node that had died for those four hours.=C2=A0 I see = this repeated in the AM logs for that entire period:

2015-03-14 07:07:28,435 INFO [C= ontainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to se= rver: dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetri= es=3D20
2015-03-14 07:= 07:28,545 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: Retry= ing connect to server: dead.node.host/dead.node.ip:47936. Already tried 19 = time(s); maxRetries=3D20
2015-03-14 07:07:29,202 INFO [ContainerLauncher #261] org.apache.hadoop.= ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. = Already tried 0 time(s); maxRetries=3D20
2015-03-14 07:07:31,074 INFO [ContainerLauncher #283] or= g.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead= .node.ip:47936. Already tried 1 time(s); maxRetries=3D20
<= font face=3D"monospace, monospace">2015-03-14 07:07:31,110 INFO [ContainerL= auncher #278] org.apache.hadoop.ipc.Client: Retrying connect to server: dea= d.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=3D20
2015-03-14 07:07:46,093 = INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retrying connec= t to server: dead.node.host/dead.node.ip:47936. Already tried 17 time(s); m= axRetries=3D20
2015-03= -14 07:07:48,455 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client= : Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tr= ied 17 time(s); maxRetries=3D20
2015-03-14 07:07:49,223 INFO [ContainerLauncher #261] org.apache.= hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:= 47936. Already tried 1 time(s); maxRetries=3D20
2015-03-14 07:07:51,095 INFO [ContainerLauncher #= 283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.ho= st/dead.node.ip:47936. Already tried 2 time(s); maxRetries=3D20
2015-03-14 07:07:51,116 INFO [Con= tainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect to serv= er: dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries= =3D20
2015-03-14 07:08= :06,097 INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retryin= g connect to server: dead.node.host/dead.node.ip:47936. Already tried 18 ti= me(s); maxRetries=3D20
2015-03-14 07:08:08,476 INFO [ContainerLauncher #293] org.apache.hadoop.ip= c.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Al= ready tried 18 time(s); maxRetries=3D20
2015-03-14 07:08:09,243 INFO [ContainerLauncher #261] org= .apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.= node.ip:47936. Already tried 2 time(s); maxRetries=3D20
2015-03-14 07:08:11,115 INFO [ContainerLa= uncher #283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead= .node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=3D20
2015-03-14 07:08:11,120 I= NFO [ContainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect= to server: dead.node.host/dead.node.ip:47936. Already tried 2 time(s); max= Retries=3D20
2015-03-1= 4 07:08:18,569 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: = Retrying connect to server: dead.node.host/dead.node.ip:47936. Already trie= d 0 time(s); maxRetries=3D20
2015-03-14 07:08:26,118 INFO [ContainerLauncher #167] org.apache.had= oop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:479= 36. Already tried 19 time(s); maxRetries=3D20
2015-03-14 07:08:28,495 INFO [ContainerLauncher #= 293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.ho= st/dead.node.ip:47936. Already tried 19 time(s); maxRetries=3D20
2015-03-14 07:08:29,264 INFO [Co= ntainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to ser= ver: dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries= =3D20

Eventually the following except= ion appeared in the AM logs and the job completed successfully:
<= div>2015-03-14 07:23:09,239 INFO [Async= Dispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskA= ttemptImpl: Diagnostics report from attempt_1423675803126_127109_m_001910_0= : cleanup failed for container container_1423675803126_127109_01_004637 : o= rg.apache.hadoop.net.ConnectTimeoutException: Call From am.node.host/am.nod= e.ip to dead.node.host:47936 failed on socket timeout exception: org.apache= .hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for= channel to be ready for connect. ch : java.nio.channels.SocketChannel[conn= ection-pending remote=3Ddead.node.host/dead.node.ip:47936]; For more detail= s see: =C2=A0http://wiki.apache.org/hadoop/SocketTimeout
= at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown S= ource)
at sun.reflect.DelegatingConstructorAccessorI= mpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
= at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(Net= Utils.java:783)
at org.apache.hadoop.net.NetUtils.wr= apException(NetUtils.java:749)
at org.apache.hadoop.= ipc.Client.call(Client.java:1415)
at org.apache.hado= op.ipc.Client.call(Client.java:1364)
at org.apache.h= adoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)=
at org.apache.hadoop.yarn.api.impl.pb.client.Contai= nerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtoco= lPBClientImpl.java:110)
at sun.reflect.GeneratedMeth= odAccessor14.invoke(Unknown Source)
at sun.reflect.D= elegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.i= nvokeMethod(RetryInvocationHandler.java:187)
at org.= apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler= .java:102)
at com.sun.proxy.$Proxy40.stopContainers(= Unknown Source)
at org.apache.hadoop.mapreduce.v2.ap= p.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:= 206)
at org.apache.hadoop.mapreduce.v2.app.launcher.= ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
at java.util.concurrent.ThreadPoolExecutor.runWorker(T= hreadPoolExecutor.java:1145)
at java.util.concurrent= .ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
= at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutE= xception: 20000 millis timeout while waiting for channel to be ready for co= nnect. ch : java.nio.channels.SocketChannel[connection-pending remote=3Ddea= d.node.host/dead.node.ip:47936]
at org.apache.hadoop= .net.NetUtils.connect(NetUtils.java:532)
at org.apac= he.hadoop.net.NetUtils.connect(NetUtils.java:493)
at= org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstr= eams(Client.java:700)
= at org.apache.hadoop.ipc.Clien= t$Connection.access$2800(Client.java:367)
at org.apa= che.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
<= font face=3D"monospace, monospace"> ... 15 more

It looks to me that = the tasks that had been running on the dead node were restarted, and the AM= attempting to clean up those tasks.=C2=A0 However, since the node was dead= it would not be able to connect.

I have=C2=A0yarn= .client.nodemanager-connect.max-wait-ms set to=C2=A0300000 (5 minutes) and= =C2=A0ipc.client.connect.max.retries.on.timeouts set to 20.=C2=A0 I see it = retry the connection 20 times in the logs, but then it starts retrying from= 0 again.=C2=A0 Also, I would expect the AM to give up the attempt to conne= ct much sooner.=C2=A0 For instance, the ResourceManager recognized the node= as dead after 10 minutes as expected.=C2=A0 I'd like to see the AM do = the same.

Has anyone encountered this behavior bef= ore?

Thanks!

--
Andrew Johnson
<= /div>



--
Andrew= Johnson
Software Engineer, Etsy



--
Andrew Johnson
Software Engineer,= Etsy
--001a11c3476ae2c09505119075da--