hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@gmail.com>
Subject Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Tue, 06 Jul 2010 08:34:55 GMT

Sorry, I couldn't take a close look at the logs until now.
Unfortunately, I could not see any huge difference between the success
and failure case. Can you please check if things like basic hostname -
ip address mapping are in place (if you have static resolution of
hostnames set up) ? A web search is giving this as the most likely
cause users have faced regarding this problem. Also do the disks have
enough size ? Also, it would be great if you can upload your hadoop
configuration information.

I do think it is very likely that configuration is the actual problem
because it works in one case anyway.


On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <bmdevelopment@gmail.com> wrote:
> Hello,
> I still have had no luck with this over the past week.
> And even get the same exact problem on a completely different 5 node cluster.
> Is it worth opening an new issue in jira for this?
> Thanks
> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <bmdevelopment@gmail.com> wrote:
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yhemanth@gmail.com> wrote:
>>> Hi,
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>> This runs fine and completes successfully:
>>>>            String str = val.toString();
>>>> This causes error and fails:
>>>>            String str = val.toString().substring(0,10);
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>> And here is even another where I used a longer, and therefore unique string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>>>Also, are the two programs being run against
>>> the exact same input data ?
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
>>> there are any failures when the reduce attempts start failing.
>> Here is the TT log from the last failed job. I do not see anything
>> besides the shuffle failure, but there
>> may be something I am overlooking or simply do not understand.
>> http://pastebin.com/DKFTyGXg
>> Thanks again!
>>> Thanks
>>> Hemanth

View raw message