hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ellis H. Wilson III" <el...@cse.psu.edu>
Subject Re: Error: Too Many Fetch Failures
Date Tue, 19 Jun 2012 19:18:41 GMT
On 06/19/12 14:11, Minh Duc Nguyen wrote:
> Take at look at slide 25:
> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
>
> It describes a similar error so hopefully this will help you.

I appreciate your prompt response Minh, but as you will notice in the 
end of my very email, I mentioned that I had previously seen this slide 
and tried two of those solutions, to no avail.  I should also note that 
I added /etc/hosts to each of my nodes such that, if it was a DNS issue, 
that would handle it.  The only other proposed solution suggested 
upgrading Jetty, but I wasn't sure about (sorry for the naiveté) how one 
could tell the version of Jetty in use.  Any ideas?  Or is this no 
longer an issue with Hadoop 2.0?

Best,

ellis


> On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III<ellis@cse.psu.edu>  wrote:
>> Hi all,
>>
>> This is my first email to the list, so feel free to be candid in your
>> complaints if I'm doing something canonically uncouth in my requests for
>> assistance.
>>
>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet
>> and each having solely a single hard disk.  I am getting the following error
>> repeatably for the TeraSort benchmark.  TeraGen runs without error, but
>> TeraSort runs predictably until this error pops up between 64% and 70%
>> completion.  This doesn't occur for every execution of the benchmark, as
>> about one out of four times that I run the benchmark it does run to
>> completion (TeraValidate included).
>>
>> Error at the CLI:
>> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
>> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id :
>> attempt_1339331790635_0002_m_004337_0, Status : FAILED
>> Container killed by the ApplicationMaster.
>>
>> Too Many fetch failures.Failing the attempt
>> 12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed
>> out
>> 12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed
>> out
>> 12/06/10 11:23:07 INFO mapreduce.Job: Task Id :
>> attempt_1339331790635_0002_m_004613_0, Status : FAILED"
>>
>> I am still warming up to Yarn, so am not deft yet at getting all the
>> logfiles I need, but under more careful inspection of the logs I could find
>> and the machines themselves it seems like this is related to many numbers of
>> sockets being up concurrently, which at some point prevents further
>> connections being made from the requesting Reduce to the Map which has the
>> data desired, leading the Reducer to believe there is some error in getting
>> that data.  These errors continue to be spewed once about every 3 minutes
>> for about 45 minutes until at last the job dies completely.
>>
>> I have attached my -site.xml files so that a better idea of my configuration
>> is evident, and any and all suggestions or queries for more info are
>> welcome.  Things I have tried already, per the document I found at
>> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:
>>
>> mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts
>> performance as I'm the only person running on the cluster, and it doesn't
>> cure the problem -- just increases chance of completion from 1/4 to 1/3 at
>> best)
>>
>> tasktracker.http.threads = 80 (default is 40 I think, and I've tried this
>> and even much higher values to no avail)
>>
>> Best, and Thanks in Advance,
>>
>> ellis
>


Mime
View raw message