Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
From: Vinod Kumar Vavilapalli <vinodkv@hortonworks.com>
Mime-Version: 1.0 (Apple Message framework v1278)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413"
Subject: Re: Error: Too Many Fetch Failures
Date: Tue, 19 Jun 2012 10:38:33 -0700
In-Reply-To: <4FE08C60.3020801@cse.psu.edu>
To: common-user@hadoop.apache.org
References: <4FE08C60.3020801@cse.psu.edu>
Message-Id: <4BDBA86F-264A-4505-BA72-337A20406619@hortonworks.com>

--Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1


Replies/more questions inline.


> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit =
ethernet and each having solely a single hard disk.  I am getting the =
following error repeatably for the TeraSort benchmark.  TeraGen runs =
without error, but TeraSort runs predictably until this error pops up =
between 64% and 70% completion.  This doesn't occur for every execution =
of the benchmark, as about one out of four times that I run the =
benchmark it does run to completion (TeraValidate included).


How many containers are you running per node?


> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : =
attempt_1339331790635_0002_m_004337_0, Status : FAILED
> Container killed by the ApplicationMaster.
>=20
> Too Many fetch failures.Failing the attempt


Clearly maps are getting killed because of fetch failures. Can you look =
at the logs of the NodeManager where this particular map task ran. That =
may have logs related to why reducers are not able to fetch map-outputs. =
It is possible that because you have only one disk per node, some of =
these nodes have bad or unfunctional disks and thereby causing fetch =
failures.

If that is the case, either you can offline these nodes or bump up =
mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, =
the default is 10. There are other some tweaks which I can tell if you =
can find more details from your logs.

HTH,
+Vinod=

--Apple-Mail=_11332F4B-0690-4B98-A027-7815150B7413--