hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rajesh putta <rajesh.p...@gmail.com>
Subject Re: Too many fetch-failures
Date Tue, 19 Jul 2011 04:54:43 GMT
Hi,

If this is the problem you are getting i think the below solution will help
you.I have shared my experiences which i faced earlier.

Issue:

while running a job

The maps complete properly. However when the reduce phase begins, it works
for sometime upto some % but then while copying the map outputs from other
machine in the 'shuffle phase' it throws an Exception and says that there
was a shuffle Error as the connection was refused.

In Brief

The reduce tasks have to collect the output of the map tasks before they can
run the sort the output and start your reduce class. This is called the
FETCH.

The job tracker passes the hostnames of the machines that ran the map tasks
to the reducer task. These hostnames must resolve to the correct ip address
of the machine that ran the map task, and the reduce task must be able to
connect to that machine on usually a port  to request the data stream.

>From Log File:

org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in
shuffle in fetcher#2 at
org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at
org.apache.hadoop.mapred.Child$4.run(Child.java:217) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by:
java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253)
at
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187)
at
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:227)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)

Fix Or Solution:

Make an entry of the hostnames whoever are in the cluster to /etc/hosts
file.
Thanks & Regards
Rajesh Putta
M Tech CSE
IIIT-H

On Tue, Jul 19, 2011 at 4:30 AM, Arun C Murthy <acm@hortonworks.com> wrote:

>
> On Jul 18, 2011, at 3:02 PM, Geoffry Roberts wrote:
>
> > All,
> >
> > I am getting the following errors during my MR jobs (see below).
> Ultimately the jobs finish well enough, but these errors do slow things
> down.  I've done some reading and I understand that this is all caused by
> failures in my network.  Is there a way of determining which node(s) in my
> cluster are causing the problem?
> >
>
> The TT running on 'localhost' ran attempt_201107180916_0030_m_000003_0
> whose output couldn't be fetched. Take a look at the TT logs and see what
> you find.
>
> Arun
>
>
>
> > Thanks
> >
> > 11/07/18 14:53:06 INFO mapreduce.Job:  map 99% reduce 28%
> > 11/07/18 14:53:10 INFO mapreduce.Job:  map 100% reduce 28%
> > 11/07/18 14:53:15 INFO mapreduce.Job: Task Id :
> attempt_201107180916_0030_m_000003_0, Status : FAILED
> > Too many fetch-failures
> > 11/07/18 14:53:15 WARN mapreduce.Job: Error reading task
> outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201107180916_0030_m_000003_0&filter=stdout
> > 11/07/18 14:53:15 WARN mapreduce.Job: Error reading task
> outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201107180916_0030_m_000003_0&filter=stderr
> > 11/07/18 14:53:17 INFO mapreduce.Job:  map 100% reduce 29%
> > 11/07/18 14:53:19 INFO mapreduce.Job:  map 96% reduce 29%
> > 11/07/18 14:53:25 INFO mapreduce.Job:  map 98% reduce 29%
> >
> >
> > --
> > Geoffry Roberts
> >
>
>

Mime
View raw message