Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D605B10BB5 for ; Wed, 10 Jul 2013 12:02:34 +0000 (UTC) Received: (qmail 77779 invoked by uid 500); 10 Jul 2013 12:02:29 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 77684 invoked by uid 500); 10 Jul 2013 12:02:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 77673 invoked by uid 99); 10 Jul 2013 12:02:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jul 2013 12:02:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of faithlessfriend@gmail.com designates 209.85.215.53 as permitted sender) Received: from [209.85.215.53] (HELO mail-la0-f53.google.com) (209.85.215.53) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jul 2013 12:02:22 +0000 Received: by mail-la0-f53.google.com with SMTP id fs12so5623088lab.26 for ; Wed, 10 Jul 2013 05:02:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=CPTC66AuMTrikHtPUx7TRa2116WsDHzNKxaphZpm1WY=; b=XXrfj2hVKypBpBpF9ftgy6W0wmQ4NGe9cGGMJ1Pq2DiiSUv1sVy5xGIXn3xsC13NfF dNtScatHrPd0hcxMyIOrwKwKwJn7xmlyMg9AnnOScbXaC+xgBPzmf8kjvpGnYFagaS6d X54MISkNTkrMlCasUgJb4ZZREmYYUqlyquxYkNkLtaPkAsNdjEJZWI8E0eCbltxmHl9I 07UsN7g2G/m+SUw1T+ZxBJqtaz6I+xZE6PxRLad591kbtHA4lgps9XLGJ8A66itUc04a feUngi4Rq0jxJ1cUtK+jEHO1PtYeLdJ6i4N0gaKDJgI+7w2GTIOH8oz6FkN3J+Ks7LVb X1aw== MIME-Version: 1.0 X-Received: by 10.112.167.136 with SMTP id zo8mr14580364lbb.33.1373457720450; Wed, 10 Jul 2013 05:02:00 -0700 (PDT) Received: by 10.114.161.7 with HTTP; Wed, 10 Jul 2013 05:02:00 -0700 (PDT) Date: Wed, 10 Jul 2013 15:02:00 +0300 Message-ID: Subject: ConnectionException in container, happens only sometimes From: Andrei To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c2a4a88aa21904e1270bd9 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2a4a88aa21904e1270bd9 Content-Type: text/plain; charset=ISO-8859-1 Hi, I'm running CDH4.3 installation of Hadoop with the following simple setup: master-host: runs NameNode, ResourceManager and JobHistoryServer slave-1-host and slave-2-hosts: DataNodes and NodeManagers. When I run simple MapReduce job (both - using streaming API or Pi example from distribution) on client I see that some tasks fail: 13/07/10 14:40:10 INFO mapreduce.Job: map 60% reduce 0% 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000003_0, Status : FAILED 13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED ... 13/07/10 14:40:23 INFO mapreduce.Job: map 60% reduce 20% ... Every time different set of tasks/attempts fails. In some cases number of failed attempts becomes critical, and the whole job fails, in other cases job is finished successfully. I can't see any dependency, but I noticed the following. Let's say, ApplicationMaster runs on _slave-1-host_. In this case on _slave-2-host_ there will be corresponding syslog with the following contents: ... 2013-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) ... 2013-07-10 11:06:20,013 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave-2-host/ 127.0.0.1 to slave-2-host:11812 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:782) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:729) at org.apache.hadoop.ipc.Client.call(Client.java:1229) at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:225) at com.sun.proxy.$Proxy6.getTask(Unknown Source) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:131) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:708) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:492) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:593) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:241) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1278) at org.apache.hadoop.ipc.Client.call(Client.java:1196) ... 3 more Notice several things: 1. This exception always happens on the different host than ApplicationMaster runs on. 2. It always tries to connect to localhost, not other host in cluster. 3. Port number (11812 in this case) is always different. My questions are: 1. I assume this is the task (container) that tries to establish connection, but what it wants to connect to? 2. Why this error happens and how can I fix it? Any suggestions are welcome. Thanks, Andrei --001a11c2a4a88aa21904e1270bd9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,=A0

I'm running CDH4.3 installat= ion of Hadoop with the following simple setup:=A0

= master-host: runs NameNode, ResourceManager and JobHistoryServer
= slave-1-host and slave-2-hosts: DataNodes and NodeManagers.=A0

When I run simple MapReduce job (both - using streaming= API or Pi example from distribution) on client I see that some tasks fail:= =A0

13/07/10 14:40:10 INFO mapreduce.Job: =A0= map 60% reduce 0%
13/07/10 14:40:14 INFO mapreduce.Job: Task Id : attempt_1373454026937_= 0005_m_000003_0, Status : FAILED
13/07/10 14:40:14 INFO mapreduce= .Job: Task Id : attempt_1373454026937_0005_m_000005_0, Status : FAILED
...
13/07/10 14:40:23 INFO mapreduce.Job: =A0map 60% r= educe 20%
...

Every time different s= et of tasks/attempts fails. In some cases number of failed attempts becomes= critical, and the whole job fails, in other cases job is finished successf= ully. I can't see any dependency, but I noticed the following.=A0

Let's say, ApplicationMaster runs on _slave-1-host_= . In this case on _slave-2-host_ there will be corresponding syslog with th= e following contents:=A0

...=A0
201= 3-07-10 11:06:10,986 INFO [main] org.apache.hadoop.ipc.Client: Retrying con= nect to server: slave-2-host/127.0.0.1:1= 1812. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWit= hFixedSleep(maxRetries=3D10, sleepTime=3D1 SECONDS)
2013-07-10 11:06:11,989 INFO [main] org.apache.hadoop.ipc.Client: Retr= ying connect to server: slave-2-host/127= .0.0.1:11812. Already tried 1 time(s); retry policy is RetryUpToMaximum= CountWithFixedSleep(maxRetries=3D10, sleepTime=3D1 SECONDS)
...
2013-07-10 11:06:20,013 INFO [main] org.apach= e.hadoop.ipc.Client: Retrying connect to server: slave-2-host/127.0.0.1:11812. Already tried 9 time(s); retry p= olicy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sleepTime=3D1= SECONDS)
2013-07-10 11:06:20,019 WARN [main] org.apache.hadoop.mapred.YarnChild= : Exception running child : java.net.ConnectException: Call From slave-2-ho= st/127.0.0.1 to slave-2-host:11812 failed = on connection exception: java.net.ConnectException: Connection refused; For= more details see: =A0http://wiki.apache.org/hadoop/ConnectionRefused
=A0 =A0 =A0 =A0 at sun.reflect.NativeConstructorAccessorImpl.newInstan= ce0(Native Method)
=A0 =A0 =A0 =A0 at sun.reflect.NativeConstruct= orAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
=A0 =A0 =A0 =A0 at sun.reflect.DelegatingConstructorAccessorImpl.newInstan= ce(DelegatingConstructorAccessorImpl.java:45)
=A0 =A0 =A0 =A0 at java.lang.reflect.Constructor.newInstance(Construct= or.java:526)
=A0 =A0 =A0 =A0 at org.apache.hadoop.net.NetUtils.wr= apWithMessage(NetUtils.java:782)
=A0 =A0 =A0 =A0 at org.apache.ha= doop.net.NetUtils.wrapException(NetUtils.java:729)
=A0 =A0 =A0 =A0 at org.apache.hadoop.ipc.Client.call(Client.java:1229)=
=A0 =A0 =A0 =A0 at org.apache.hadoop.ipc.WritableRpcEngine$Invok= er.invoke(WritableRpcEngine.java:225)
=A0 =A0 =A0 =A0 at com.sun.= proxy.$Proxy6.getTask(Unknown Source)
=A0 =A0 =A0 =A0 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.j= ava:131)
Caused by: java.net.ConnectException: Connection refused=
=A0 =A0 =A0 =A0 at sun.nio.ch.SocketChannelImpl.checkConnect(Nat= ive Method)
=A0 =A0 =A0 =A0 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketCh= annelImpl.java:708)
=A0 =A0 =A0 =A0 at org.apache.hadoop.net.Sock= etIOWithTimeout.connect(SocketIOWithTimeout.java:207)
=A0 =A0 =A0= =A0 at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:528)
=A0 =A0 =A0 =A0 at org.apache.hadoop.net.NetUtils.connect(NetUtils.jav= a:492)
=A0 =A0 =A0 =A0 at org.apache.hadoop.ipc.Client$Connection= .setupConnection(Client.java:499)
=A0 =A0 =A0 =A0 at org.apache.h= adoop.ipc.Client$Connection.setupIOstreams(Client.java:593)
=A0 =A0 =A0 =A0 at org.apache.hadoop.ipc.Client$Connection.access$2000= (Client.java:241)
=A0 =A0 =A0 =A0 at org.apache.hadoop.ipc.Client= .getConnection(Client.java:1278)
=A0 =A0 =A0 =A0 at org.apache.ha= doop.ipc.Client.call(Client.java:1196)
=A0 =A0 =A0 =A0 ... 3 more


Notice several things:=A0

1. This exception alw= ays happens on the different host than ApplicationMaster runs on.=A0
<= div>2. It always tries to connect to localhost, not other host in cluster.= =A0
3. Port number (11812 in this case) is always different.=A0
=
My questions are:=A0

1. I assume th= is is the task (container) that tries to establish connection, but what it = wants to connect to?=A0
2. Why this error happens and how can I fix it?=A0

Any suggestions are welcome.

Thanks,=A0
Andrei
--001a11c2a4a88aa21904e1270bd9--