hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: Reduce task attempt retry strategy
Date Tue, 07 Apr 2009 04:35:16 GMT
Stefan Will wrote:
> Hi,
> I had a flaky machine the other day that was still accepting jobs and
> sending heartbeats, but caused all reduce task attempts to fail. This in
> turn caused the whole job to fail because the same reduce task was retried 3
> times on that particular machine.
What is your cluster size? If a task fails on a machine then its 
re-tried on some other machine (based on number of good machines left in 
the cluster). After certain number of failures, the machine will be 
blacklisted (again based on number of machine left in the cluster). 3 
different reducers might be scheduled on that machine but that should 
not lead to job failure. Can you explain in detail what exactly 
happened. Find out where the attempts got scheduled from the 
jobtracker's log.
> Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
> always thought that the framework would retry jobs on a different machine if
> retries on the original machine keep failing. E.g. I would have expected to
> retry once or twice on the same machine, but then switch to a different one
> to minimize the likelihood of getting stuck on a bad machine.
> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
> improving on this in the future ?
> Thanks,
> Stefan

View raw message