hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Diwakar <ddeepa...@gmail.com>
Subject Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Date Tue, 27 Jul 2010 19:31:28 GMT
Hey friends,

I got stuck on setting up hdfs cluster and getting this error while running
simple wordcount example(I did that 2 yrs back not had any problem).

Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from
(
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
).

 I checked the firewall settings and /etc/hosts there is no issue there.
Also master and slave are accessible both ways.

Also the input size very low ~ 3 MB  and hence there shouldn't be no issue
because ulimit(its btw of 4096).

Would be really thankful  if  anyone can guide me to resolve this.

Thanks & regards,
- Deepak Diwakar,




On 28 June 2010 18:39, bmdevelopment <bmdevelopment@gmail.com> wrote:

> Hi, Sorry for the cross-post. But just trying to see if anyone else
> has had this issue before.
> Thanks
>
>
> ---------- Forwarded message ----------
> From: bmdevelopment <bmdevelopment@gmail.com>
> Date: Fri, Jun 25, 2010 at 10:56 AM
> Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES;
> bailing-out.
> To: mapreduce-user@hadoop.apache.org
>
>
> Hello,
> Thanks so much for the reply.
> See inline.
>
> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <yhemanth@gmail.com>
> wrote:
> > Hi,
> >
> >> I've been getting the following error when trying to run a very simple
> >> MapReduce job.
> >> Map finishes without problem, but error occurs as soon as it enters
> >> Reduce phase.
> >>
> >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> >> attempt_201006241812_0001_r_000000_0, Status : FAILED
> >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> >>
> >> I am running a 5 node cluster and I believe I have all my settings
> correct:
> >>
> >> * ulimit -n 32768
> >> * DNS/RDNS configured properly
> >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> >> * mapred-site.xml : http://pastebin.com/JraVQZcW
> >>
> >> The program is very simple - just counts a unique string in a log file.
> >> See here: http://pastebin.com/5uRG3SFL
> >>
> >> When I run, the job fails and I get the following output.
> >> http://pastebin.com/AhW6StEb
> >>
> >> However, runs fine when I do *not* use substring() on the value (see
> >> map function in code above).
> >>
> >> This runs fine and completes successfully:
> >>            String str = val.toString();
> >>
> >> This causes error and fails:
> >>            String str = val.toString().substring(0,10);
> >>
> >> Please let me know if you need any further information.
> >> It would be greatly appreciated if anyone could shed some light on this
> problem.
> >
> > It catches attention that changing the code to use a substring is
> > causing a difference. Assuming it is consistent and not a red herring,
>
> Yes, this has been consistent over the last week. I was running 0.20.1
> first and then
> upgrade to 0.20.2 but results have been exactly the same.
>
> > can you look at the counters for the two jobs using the JobTracker web
> > UI - things like map records, bytes etc and see if there is a
> > noticeable difference ?
>
> Ok, so here is the first job using write.set(value.toString()); having
> *no* errors:
> http://pastebin.com/xvy0iGwL
>
> And here is the second job using
> write.set(value.toString().substring(0, 10)); that fails:
> http://pastebin.com/uGw6yNqv
>
> And here is even another where I used a longer, and therefore unique
> string,
> by write.set(value.toString().substring(0, 20)); This makes every line
> unique, similar to first job.
> Still fails.
> http://pastebin.com/GdQ1rp8i
>
> >Also, are the two programs being run against
> > the exact same input data ?
>
> Yes, exactly the same input: a single csv file with 23K lines.
> Using a shorter string leads to more like keys and therefore more
> combining/reducing, but going
> by the above it seems to fail whether the substring/key is entirely
> unique (23000 combine output records) or
> mostly the same (9 combine output records).
>
> >
> > Also, since the cluster size is small, you could also look at the
> > tasktracker logs on the machines where the maps have run to see if
> > there are any failures when the reduce attempts start failing.
>
> Here is the TT log from the last failed job. I do not see anything
> besides the shuffle failure, but there
> may be something I am overlooking or simply do not understand.
> http://pastebin.com/DKFTyGXg
>
> Thanks again!
>
> >
> > Thanks
> > Hemanth
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message