hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: Too many fetch failures AND Shuffle error
Date Tue, 01 Jul 2008 05:15:15 GMT
Tarandeep Singh wrote:
> I am getting this error as well.
> As Sayali mentioned in his mail, I updated the /etc/hosts file with the
> slave machines IP addresses, but I am still getting this error.
>
> Amar, which is the url that you were talking about in your mail -
> "There will be a URL associated with a map that the reducer try to fetch
> (check the reducer logs for this url)"
>
> Please tell me where should I look for it... I will try to access it
> manually to see if this error is due to firewall.
>   
One thing you can do is to see if all the maps that have failed while 
fetching are from remote host. Look at the web-ui to find out where the 
map task finished and look at the reduce task logs to find out which 
maps-fetches failed.

I am not sure if the reduce task logs have it. Try this
port=tasktracker.http.port (this is set through conf)
tthost = tasktracker hostname (destination tasktracker from where the 
map out needs to be fetched)
jobid = complete job id "job_...."
mapid = the task attemptid "attempt_..." that has successfully completed 
the map
reduce-partition-id = this is the partition number for reduce task. 
task_..._r_$i_$j will have reduce-partition-id as int-value($i).

url = 
http://'$tthost':'$port'/mapOutput?job='$jobid'&map='$mapid'&reduce='$reduce-partition-id'
'$var' is what you have to substitute.
Amar
> Thanks,
> Taran
>
> On Thu, Jun 19, 2008 at 11:43 PM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
>
>   
>> Yeah. With 2 nodes the reducers will go up to 16% because the reducer are
>> able to fetch maps from the same machine (locally) but fails to copy it from
>> the remote machine. A common reason in such cases is the *restricted machine
>> access* (firewall etc). The web-server on a machine/node hosts map outputs
>> which the reducers on the other machine are not able to access. There will
>> be a URL associated with a map that the reducer try to fetch (check the
>> reducer logs for this url). Just try accessing it manually from the
>> reducer's machine/node. Most likely this experiment should also fail. Let us
>> know if this is not the case.
>> Amar
>>
>> Sayali Kulkarni wrote:
>>
>>     
>>> Can you post the reducer logs. How many nodes are there in the cluster?
>>>       
>>>>         
>>> There are 6 nodes in the cluster - 1 master and 5 slaves
>>>  I tried to reduce the number of nodes, and found that the problem is
>>> solved only if there is a single node in the cluster. So I can deduce that
>>> the problem is there in some configuration.
>>>
>>> Configuration file:
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>>
>>> <!-- Put site-specific property overrides in this file. -->
>>>
>>> <configuration>
>>>
>>> <property>
>>>  <name>hadoop.tmp.dir</name>
>>>  <value>/extra/HADOOP/hadoop-0.16.3/tmp/dir/hadoop-${user.name}</value>
>>>  <description>A base for other temporary directories.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>fs.default.name</name>
>>>  <value>hdfs://10.105.41.25:54310</value>
>>>  <description>The name of the default file system.  A URI whose
>>>  scheme and authority determine the FileSystem implementation.  The
>>>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>>>  the FileSystem implementation class.  The uri's authority is used to
>>>  determine the host, port, etc. for a filesystem.</description>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.job.tracker</name>
>>>  <value>10.105.41.25:54311</value>
>>>  <description>The host and port that the MapReduce job tracker runs
>>>  at.  If "local", then jobs are run in-process as a single map
>>>  and reduce task.
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>dfs.replication</name>
>>>  <value>2</value>
>>>  <description>Default block replication.
>>>  The actual number of replications can be specified when the file is
>>> created.
>>>  The default is used if replication is not specified in create time.
>>>  </description>
>>> </property>
>>>
>>>
>>> <property>
>>>  <name>mapred.child.java.opts</name>
>>>  <value>-Xmx1048M</value>
>>> </property>
>>>
>>> <property>
>>>        <name>mapred.local.dir</name>
>>>        <value>/extra/HADOOP/hadoop-0.16.3/tmp/mapred</value>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.map.tasks</name>
>>>  <value>53</value>
>>>  <description>The default number of map tasks per job.  Typically set
>>>  to a prime several times greater than number of available hosts.
>>>  Ignored when mapred.job.tracker is "local".
>>>  </description>
>>> </property>
>>>
>>> <property>
>>>  <name>mapred.reduce.tasks</name>
>>>  <value>7</value>
>>>  <description>The default number of reduce tasks per job.  Typically set
>>>  to a prime close to the number of available hosts.  Ignored when
>>>  mapred.job.tracker is "local".
>>>  </description>
>>> </property>
>>>
>>> </configuration>
>>>
>>>
>>> ============
>>> This is the output that I get when running the tasks with 2 nodes in the
>>> cluster:
>>>
>>> 08/06/20 11:07:45 INFO mapred.FileInputFormat: Total input paths to
>>> process : 1
>>> 08/06/20 11:07:45 INFO mapred.JobClient: Running job:
>>> job_200806201106_0001
>>> 08/06/20 11:07:46 INFO mapred.JobClient:  map 0% reduce 0%
>>> 08/06/20 11:07:53 INFO mapred.JobClient:  map 8% reduce 0%
>>> 08/06/20 11:07:55 INFO mapred.JobClient:  map 17% reduce 0%
>>> 08/06/20 11:07:57 INFO mapred.JobClient:  map 26% reduce 0%
>>> 08/06/20 11:08:00 INFO mapred.JobClient:  map 34% reduce 0%
>>> 08/06/20 11:08:01 INFO mapred.JobClient:  map 43% reduce 0%
>>> 08/06/20 11:08:04 INFO mapred.JobClient:  map 47% reduce 0%
>>> 08/06/20 11:08:05 INFO mapred.JobClient:  map 52% reduce 0%
>>> 08/06/20 11:08:08 INFO mapred.JobClient:  map 60% reduce 0%
>>> 08/06/20 11:08:09 INFO mapred.JobClient:  map 69% reduce 0%
>>> 08/06/20 11:08:10 INFO mapred.JobClient:  map 73% reduce 0%
>>> 08/06/20 11:08:12 INFO mapred.JobClient:  map 78% reduce 0%
>>> 08/06/20 11:08:13 INFO mapred.JobClient:  map 82% reduce 0%
>>> 08/06/20 11:08:15 INFO mapred.JobClient:  map 91% reduce 1%
>>> 08/06/20 11:08:16 INFO mapred.JobClient:  map 95% reduce 1%
>>> 08/06/20 11:08:18 INFO mapred.JobClient:  map 99% reduce 3%
>>> 08/06/20 11:08:23 INFO mapred.JobClient:  map 100% reduce 3%
>>> 08/06/20 11:08:25 INFO mapred.JobClient:  map 100% reduce 7%
>>> 08/06/20 11:08:28 INFO mapred.JobClient:  map 100% reduce 10%
>>> 08/06/20 11:08:30 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:08:33 INFO mapred.JobClient:  map 100% reduce 12%
>>> 08/06/20 11:08:35 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:08:38 INFO mapred.JobClient:  map 100% reduce 15%
>>> 08/06/20 11:09:54 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:09:54 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000002_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:09:56 INFO mapred.JobClient:  map 100% reduce 9%
>>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000003_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:09:56 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000011_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:09:57 INFO mapred.JobClient:  map 95% reduce 9%
>>> 08/06/20 11:09:59 INFO mapred.JobClient:  map 100% reduce 9%
>>> 08/06/20 11:10:04 INFO mapred.JobClient:  map 100% reduce 10%
>>> 08/06/20 11:10:07 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:10:09 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:12 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:10:14 INFO mapred.JobClient:  map 100% reduce 15%
>>> 08/06/20 11:10:17 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:10:24 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:24 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000000_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:10:29 INFO mapred.JobClient:  map 100% reduce 11%
>>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000001_0, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:10:29 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000003_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:10:32 INFO mapred.JobClient:  map 100% reduce 12%
>>> 08/06/20 11:10:37 INFO mapred.JobClient:  map 100% reduce 13%
>>> 08/06/20 11:10:42 INFO mapred.JobClient:  map 100% reduce 14%
>>> 08/06/20 11:10:47 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:10:52 INFO mapred.JobClient:  map 95% reduce 16%
>>> 08/06/20 11:10:52 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000020_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:10:54 INFO mapred.JobClient:  map 100% reduce 16%
>>> 08/06/20 11:11:02 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000017_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:09 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:24 INFO mapred.JobClient:  map 95% reduce 17%
>>> 08/06/20 11:11:24 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000007_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:27 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:32 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000012_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:34 INFO mapred.JobClient:  map 95% reduce 17%
>>> 08/06/20 11:11:34 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000019_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:39 INFO mapred.JobClient:  map 91% reduce 18%
>>> 08/06/20 11:11:39 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000002_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:41 INFO mapred.JobClient:  map 95% reduce 18%
>>> 08/06/20 11:11:42 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:11:42 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000006_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:11:44 INFO mapred.JobClient:  map 100% reduce 17%
>>> 08/06/20 11:11:44 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_r_000003_1, Status : FAILED
>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>> 08/06/20 11:11:51 INFO mapred.JobClient:  map 100% reduce 18%
>>> 08/06/20 11:11:54 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:11:59 INFO mapred.JobClient:  map 95% reduce 19%
>>> 08/06/20 11:11:59 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000010_0, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:12:02 INFO mapred.JobClient:  map 100% reduce 19%
>>> 08/06/20 11:12:07 INFO mapred.JobClient:  map 100% reduce 20%
>>> 08/06/20 11:12:08 INFO mapred.JobClient:  map 100% reduce 33%
>>> 08/06/20 11:12:09 INFO mapred.JobClient:  map 100% reduce 47%
>>> 08/06/20 11:12:11 INFO mapred.JobClient:  map 100% reduce 60%
>>> 08/06/20 11:12:16 INFO mapred.JobClient:  map 100% reduce 62%
>>> 08/06/20 11:12:24 INFO mapred.JobClient:  map 100% reduce 63%
>>> 08/06/20 11:12:26 INFO mapred.JobClient:  map 100% reduce 64%
>>> 08/06/20 11:12:31 INFO mapred.JobClient:  map 100% reduce 65%
>>> 08/06/20 11:12:31 INFO mapred.JobClient: Task Id :
>>> task_200806201106_0001_m_000019_1, Status : FAILED
>>> Too many fetch-failures
>>> 08/06/20 11:12:36 INFO mapred.JobClient:  map 100% reduce 66%
>>> 08/06/20 11:12:38 INFO mapred.JobClient:  map 100% reduce 67%
>>> 08/06/20 11:12:39 INFO mapred.JobClient:  map 100% reduce 80%
>>>
>>> ===============
>>>
>>>
>>>
>>>       
>>>> Are you seeing this for all the maps and reducers?
>>>>
>>>>         
>>> Yes, this happens on all the maps and reducers. I tried to keep just 2
>>> nodes in the cluster but still the problem exists.
>>>
>>>
>>>
>>>       
>>>> Are the reducers progressing at all?
>>>>
>>>>
>>>>         
>>> The reducers continue to execute upto a certain point, but after that they
>>> just do not proceed at all. They just stop at an average of 16%.
>>>
>>>
>>>       
>>>> Are all the maps that the reducer is failing from a remote machine?
>>>>
>>>>         
>>> Yes.
>>>
>>>
>>>
>>>       
>>>> Are all the failed maps/reducers from the same machine?
>>>>
>>>>         
>>> All the maps and reducers are failing anyways.
>>> Thanks for the help in advance,
>>>
>>> Regards,
>>> Sayali
>>>
>>>       ---------------------------------
>>> Sent from Yahoo! Mail.
>>> A Smarter Email.
>>>
>>>
>>>       
>>     
>
>   


Mime
View raw message