hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson" <sa...@pearsonwholesale.com>
Subject Re: Reduce task attempt retry strategy
Date Tue, 07 Apr 2009 05:41:58 GMT
I seen the same thing happening on 0.19.branch.

When a task fails on the reduce end it always retries on the same node until 
it kills the job for to many failed tries on one reduce task.

I am running a cluster of 7 nodes.

Billy


"Stefan Will" <stefan.will@gmx.net> wrote in message 
news:C5FF7F91.18C09%stefan.will@gmx.net...
Hi,

I had a flaky machine the other day that was still accepting jobs and
sending heartbeats, but caused all reduce task attempts to fail. This in
turn caused the whole job to fail because the same reduce task was retried 3
times on that particular machine.

Perhaps I¹m confusing this with the block placement strategy in hdfs, but I
always thought that the framework would retry jobs on a different machine if
retries on the original machine keep failing. E.g. I would have expected to
retry once or twice on the same machine, but then switch to a different one
to minimize the likelihood of getting stuck on a bad machine.

What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans for
improving on this in the future ?

Thanks,
Stefan



Mime
View raw message