Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of yhemanth@gmail.com designates
 209.85.160.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=JJMu0g6GZC34YAMCqm+5LCRHlyjfl+DXTxAUexVy8gOHeaFVdarMSWppNZ+vhkYZAB
         0dhgD9wEJbyTQ3XuMk4XLRQecS26tLV2V1hL+bsmFnDZYzr5f1XD2iZHE8KyBp+ZAVZs
         vg5f7iz54svgWYINwlyx6cEw1Zk+mom7WpaVw=
MIME-Version: 1.0
In-Reply-To: <AANLkTilvRJCOJyz-WqGOhjg8VnGah9bSKAmTpAvgJai1@mail.gmail.com>
References: <AANLkTilvRJCOJyz-WqGOhjg8VnGah9bSKAmTpAvgJai1@mail.gmail.com>
Date: Fri, 25 Jun 2010 10:10:55 +0530
Message-ID: <AANLkTilrBPxCgS2T70CS8c9BqeJqTtYN9idvUx6xCPG1@mail.gmail.com>
Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
From: Hemanth Yamijala <yhemanth@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi,

> I've been getting the following error when trying to run a very simple
> MapReduce job.
> Map finishes without problem, but error occurs as soon as it enters
> Reduce phase.
>
> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
> attempt_201006241812_0001_r_000000_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>
> I am running a 5 node cluster and I believe I have all my settings correc=
t:
>
> * ulimit -n 32768
> * DNS/RDNS configured properly
> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
> * mapred-site.xml : http://pastebin.com/JraVQZcW
>
> The program is very simple - just counts a unique string in a log file.
> See here: http://pastebin.com/5uRG3SFL
>
> When I run, the job fails and I get the following output.
> http://pastebin.com/AhW6StEb
>
> However, runs fine when I do *not* use substring() on the value (see
> map function in code above).
>
> This runs fine and completes successfully:
> =A0 =A0 =A0 =A0 =A0 =A0String str =3D val.toString();
>
> This causes error and fails:
> =A0 =A0 =A0 =A0 =A0 =A0String str =3D val.toString().substring(0,10);
>
> Please let me know if you need any further information.
> It would be greatly appreciated if anyone could shed some light on this p=
roblem.

It catches attention that changing the code to use a substring is
causing a difference. Assuming it is consistent and not a red herring,
can you look at the counters for the two jobs using the JobTracker web
UI - things like map records, bytes etc and see if there is a
noticeable difference ? Also, are the two programs being run against
the exact same input data ?

Also, since the cluster size is small, you could also look at the
tasktracker logs on the machines where the maps have run to see if
there are any failures when the reduce attempts start failing.

Thanks
Hemanth