hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bmdevelopment <bmdevelopm...@gmail.com>
Subject Fwd: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Mon, 28 Jun 2010 13:09:22 GMT
Hi, Sorry for the cross-post. But just trying to see if anyone else
has had this issue before.

---------- Forwarded message ----------
From: bmdevelopment <bmdevelopment@gmail.com>
Date: Fri, Jun 25, 2010 at 10:56 AM
Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
To: mapreduce-user@hadoop.apache.org

Thanks so much for the reply.
See inline.

On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yhemanth@gmail.com> wrote:
> Hi,
>> I've been getting the following error when trying to run a very simple
>> MapReduce job.
>> Map finishes without problem, but error occurs as soon as it enters
>> Reduce phase.
>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> I am running a 5 node cluster and I believe I have all my settings correct:
>> * ulimit -n 32768
>> * DNS/RDNS configured properly
>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> The program is very simple - just counts a unique string in a log file.
>> See here: http://pastebin.com/5uRG3SFL
>> When I run, the job fails and I get the following output.
>> http://pastebin.com/AhW6StEb
>> However, runs fine when I do *not* use substring() on the value (see
>> map function in code above).
>> This runs fine and completes successfully:
>>            String str = val.toString();
>> This causes error and fails:
>>            String str = val.toString().substring(0,10);
>> Please let me know if you need any further information.
>> It would be greatly appreciated if anyone could shed some light on this problem.
> It catches attention that changing the code to use a substring is
> causing a difference. Assuming it is consistent and not a red herring,

Yes, this has been consistent over the last week. I was running 0.20.1
first and then
upgrade to 0.20.2 but results have been exactly the same.

> can you look at the counters for the two jobs using the JobTracker web
> UI - things like map records, bytes etc and see if there is a
> noticeable difference ?

Ok, so here is the first job using write.set(value.toString()); having
*no* errors:

And here is the second job using
write.set(value.toString().substring(0, 10)); that fails:

And here is even another where I used a longer, and therefore unique string,
by write.set(value.toString().substring(0, 20)); This makes every line
unique, similar to first job.
Still fails.

>Also, are the two programs being run against
> the exact same input data ?

Yes, exactly the same input: a single csv file with 23K lines.
Using a shorter string leads to more like keys and therefore more
combining/reducing, but going
by the above it seems to fail whether the substring/key is entirely
unique (23000 combine output records) or
mostly the same (9 combine output records).

> Also, since the cluster size is small, you could also look at the
> tasktracker logs on the machines where the maps have run to see if
> there are any failures when the reduce attempts start failing.

Here is the TT log from the last failed job. I do not see anything
besides the shuffle failure, but there
may be something I am overlooking or simply do not understand.

Thanks again!

> Thanks
> Hemanth

View raw message