hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Hung job
Date Mon, 13 Mar 2006 22:43:44 GMT
stack wrote:
> In synopsis the problem goes as follows:
> 
> If a reduce cannot pick up map outputs -- for example, the output has 
> been moved aside because of a ChecksumException (See below stack trace) 
> -- then the job gets stuck with the reduce task trying and failing every 
> ten seconds or so to pick up the non-existent map output part.
> 
> Somehow the reduce needs to give up and the jobtracker needs to rerun 
> the map just as it would if the tasktracker had died completely.

Perhaps what should happen is that the TaskTracker should exit when it 
encounters errors reading map output.  That way the jobtracker will 
re-schedule the map, and the reduce task will wait until that map is 
re-done.

I've attached a patch.  The TaskTracker will restart, but with a new id, 
so all of its tasks will be considered lost.  This will unfortunately 
lose other map tasks done by this tasktracker, but at least things will 
keep going.

Does this look right to you?

Doug

Mime
View raw message