hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@gmail.com>
Subject Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Fri, 25 Jun 2010 04:40:55 GMT

> I've been getting the following error when trying to run a very simple
> MapReduce job.
> Map finishes without problem, but error occurs as soon as it enters
> Reduce phase.
> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> attempt_201006241812_0001_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> I am running a 5 node cluster and I believe I have all my settings correct:
> * ulimit -n 32768
> * DNS/RDNS configured properly
> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> * mapred-site.xml : http://pastebin.com/JraVQZcW
> The program is very simple - just counts a unique string in a log file.
> See here: http://pastebin.com/5uRG3SFL
> When I run, the job fails and I get the following output.
> http://pastebin.com/AhW6StEb
> However, runs fine when I do *not* use substring() on the value (see
> map function in code above).
> This runs fine and completes successfully:
>            String str = val.toString();
> This causes error and fails:
>            String str = val.toString().substring(0,10);
> Please let me know if you need any further information.
> It would be greatly appreciated if anyone could shed some light on this problem.

It catches attention that changing the code to use a substring is
causing a difference. Assuming it is consistent and not a red herring,
can you look at the counters for the two jobs using the JobTracker web
UI - things like map records, bytes etc and see if there is a
noticeable difference ? Also, are the two programs being run against
the exact same input data ?

Also, since the cluster size is small, you could also look at the
tasktracker logs on the machines where the maps have run to see if
there are any failures when the reduce attempts start failing.


View raw message