Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of tarandeep@gmail.com
 designates 209.85.146.182 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:references;
        b=DsnUqVyDdLoOvo0ODjAH7/71LnGGG4wB9IVD0r2Eqdj8KmaqmBlbS8lJ0YVEJLcoRe
         HmAO7zeN+B4A8TzWF0yb6y7NvJ++x9TzhxcakuM1rU6sPOLkp011CeXs52sBKyTfl7wD
         936LWdB72/vkFztjJ7KgeEbryJSUCMIO/5Ebo=
Message-ID: <e75c02ef0806301407o213310a8p274b992343492f18@mail.gmail.com>
Date: Mon, 30 Jun 2008 14:07:57 -0700
From: "Tarandeep Singh" <tarandeep@gmail.com>
To: core-user@hadoop.apache.org
Subject: Re: Too many fetch failures AND Shuffle error
In-Reply-To: <485B5192.2090904@yahoo-inc.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_19087_5212221.1214860077792"
References: <666648.13392.qm@web7613.mail.in.yahoo.com>
	 <485B5192.2090904@yahoo-inc.com>

------=_Part_19087_5212221.1214860077792
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

I am getting this error as well.
As Sayali mentioned in his mail, I updated the /etc/hosts file with the
slave machines IP addresses, but I am still getting this error.

Amar, which is the url that you were talking about in your mail -
"There will be a URL associated with a map that the reducer try to fetch
(check the reducer logs for this url)"

Please tell me where should I look for it... I will try to access it
manually to see if this error is due to firewall.

Thanks,
Taran

On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <amarrk@yahoo-inc.com> wrote:

> Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
> able to fetch maps from the same machine (locally) but fails to copy it from
> the remote machine. A common reason in such cases is the *restricted machine
> access* (firewall etc). The web-server on a machine/node hosts map outputs
> which the reducers on the other machine are not able to access. There will
> be a URL associated with a map that the reducer try to fetch (check the
> reducer logs for this url). Just try accessing it manually from the
> reducer's machine/node. Most likely this experiment should also fail. Let us
> know if this is not the case.
> Amar
>
> Sayali Kulkarni wrote:
>
>> Can you post the reducer logs. How many nodes are there in the cluster?
>>>
>>>
>> There are 6 nodes in the cluster - 1 master and 5 slaves
>>  I tried to reduce the number of nodes, and found that the problem is
>> solved only if there is a single node in the cluster. So I can deduce that
>> the problem is there in some configuration.
>>
>> Configuration file:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>>
>> <property>
>>  <name>hadoop.tmp.dir</name>
>>  <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>>  <description>A base for other temporary directories.</description>
>> </property>
>>
>> <property>
>>  <name>fs.default.name</name>
>>  <value>hdfs://10.105.41.25:54310</value>
>>  <description>The name of the default file system.  A URI whose
>>  scheme and authority determine the FileSystem implementation.  The
>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>  the FileSystem implementation class.  The uri's authority is used to
>>  determine the host, port, etc. for a filesystem.</description>
>> </property>
>>
>> <property>
>>  <name>mapred.job.tracker</name>
>>  <value>10.105.41.25:54311</value>
>>  <description>The host and port that the MapReduce job tracker runs
>>  at.  If "local", then jobs are run in-process as a single map
>>  and reduce task.
>>  </description>
>> </property>
>>
>> <property>
>>  <name>dfs.replication</name>
>>  <value>2</value>
>>  <description>Default block replication.
>>  The actual number of replications can be specified when the file is
>> created.
>>  The default is used if replication is not specified in create time.
>>  </description>
>> </property>
>>
>>
>> <property>
>>  <name>mapred.child.java.opts</name>
>>  <value>-Xmx1048M</value>
>> </property>
>>
>> <property>
>>        <name>mapred.local.dir</name>
>>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
>> </property>
>>
>> <property>
>>  <name>mapred.map.tasks</name>
>>  <value>53</value>
>>  <description>The default number of map tasks per job.  Typically set
>>  to a prime several times greater than number of available hosts.
>>  Ignored when mapred.job.tracker is "local".
>>  </description>
>> </property>
>>
>> <property>
>>  <name>mapred.reduce.tasks</name>
>>  <value>7</value>
>>  <description>The default number of reduce tasks per job.  Typically set
>>  to a prime close to the number of available hosts.  Ignored when
>>  mapred.job.tracker is "local".
>>  </description>
>> </property>
>>
>> </configuration>
>>
>>
>> ============
>> This is the output that I get when running the tasks with 2 nodes in the
>> cluster:
>>
>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
>> job_200806201106_0001
>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000002_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000011_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
>> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
>> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
>> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
>> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000001_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000003_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
>> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
>> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
>> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
>> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000020_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
>> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000017_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
>> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000007_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000012_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
>> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
>> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000002_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
>> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000006_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
>> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_r_000003_1, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
>> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
>> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000010_0, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
>> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
>> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
>> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
>> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
>> 08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
>> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
>> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
>> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
>> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
>> task_200806201106_0001_m_000019_1, Status : FAILED
>> Too many fetch-failures
>> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
>> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
>> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>>
>> ===============
>>
>>
>>
>>> Are you seeing this for all the maps and reducers?
>>>
>> Yes, this happens on all the maps and reducers. I tried to keep just 2
>> nodes in the cluster but still the problem exists.
>>
>>
>>
>>> Are the reducers progressing at all?
>>>
>>>
>> The reducers continue to execute upto a certain point, but after that they
>> just do not proceed at all. They just stop at an average of 16%.
>>
>>
>>> Are all the maps that the reducer is failing from a remote machine?
>>>
>> Yes.
>>
>>
>>
>>> Are all the failed maps/reducers from the same machine?
>>>
>> All the maps and reducers are failing anyways.
>> Thanks for the help in advance,
>>
>> Regards,
>> Sayali
>>
>>       ---------------------------------
>> Sent from Yahoo! Mail.
>> A Smarter Email.
>>
>>
>
>

------=_Part_19087_5212221.1214860077792--