hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kumar Vavilapalli <vino...@hortonworks.com>
Subject Re: Error: Too Many Fetch Failures
Date Tue, 19 Jun 2012 17:38:33 GMT

Replies/more questions inline.

> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet and each having
solely a single hard disk.  I am getting the following error repeatably for the TeraSort benchmark.
 TeraGen runs without error, but TeraSort runs predictably until this error pops up between
64% and 70% completion.  This doesn't occur for every execution of the benchmark, as about
one out of four times that I run the benchmark it does run to completion (TeraValidate included).

How many containers are you running per node?

> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : attempt_1339331790635_0002_m_004337_0,
Status : FAILED
> Container killed by the ApplicationMaster.
> Too Many fetch failures.Failing the attempt

Clearly maps are getting killed because of fetch failures. Can you look at the logs of the
NodeManager where this particular map task ran. That may have logs related to why reducers
are not able to fetch map-outputs. It is possible that because you have only one disk per
node, some of these nodes have bad or unfunctional disks and thereby causing fetch failures.

If that is the case, either you can offline these nodes or bump up mapreduce.reduce.shuffle.maxfetchfailures
to tolerate these failures, the default is 10. There are other some tweaks which I can tell
if you can find more details from your logs.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message