hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Will <stefan.w...@gmx.net>
Subject Re: Reduce task attempt retry strategy
Date Mon, 13 Apr 2009 17:17:59 GMT
Jothi, thanks for the explanation. One question though: why shouldn't timed
out tasks be retried on a different machine ? As you pointed out, it could
very well have been due to the machine having problems. To me a timeout is
just like any other kind of failure.

-- Stefan


> From: Jothi Padmanabhan <jothipn@yahoo-inc.com>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Mon, 13 Apr 2009 19:00:38 +0530
> To: <core-user@hadoop.apache.org>
> Subject: Re: Reduce task attempt retry strategy
> 
> Currently, only failed tasks are attempted on a node other than the one
> where it failed. For killed tasks, there is no such policy for retries.
> 
> "failed to report status" usually indicates that the task did not report
> sufficient progress. However, it is possible that the task itself was not
> progressing fast enough because the machine where it ran had problems.
> 
> 
> On 4/8/09 12:33 AM, "Stefan Will" <stefan.will@gmx.net> wrote:
> 
>> My cluster has 27 nodes with a total reduce task capacity of 54. The job had
>> 31 reducers. I actually had a task today that showed the behavior you're
>> describing: 3 tries on one machine, and then the 4th on a different one.
>> 
>> As for the particular job I was talking about before:
>> 
>> Here are the stats for the job:
>> 
>> Kind    Total Tasks(successful+failed+killed)    Successful tasks    Failed
>> tasks    Killed tasks    Start Time    Finish Time
>> Setup     1     1     0     0     4-Apr-2009 00:30:16     4-Apr-2009
>> 00:30:33 (17sec)
>> Map     64     49     12     3     4-Apr-2009 00:30:33     4-Apr-2009
>> 01:11:15 (40mins, 41sec)
>> Reduce     34     30     4     0     4-Apr-2009 00:30:44     4-Apr-2009
>> 04:31:36 (4hrs, 52sec)
>> Cleanup     4     0     4     0     4-Apr-2009 04:31:36     4-Apr-2009
>> 06:32:00 (2hrs, 24sec)
>> 
>> 
>> Not sure what to look for in the jobtracker log. All it shows for that
>> particular failed task is that it assigned it to the same machine 4 times
>> and then eventually failed. Perhaps something to note is that the 4 failures
>> were all due to timeouts:
>> 
>> "Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802
>> seconds. Killing!"
>> 
>> Also, looking at the logs, there was a map task too that was retried on that
>> particuar box 4 times without going to a different one. Perhaps it had
>> something to do with the way this machine failed: The jobtracker still
>> considered it live, while all actual tasks assigned to it timed out.
>> 
>> -- Stefan
>> 
>> 
>> 
>>> From: Amar Kamat <amarrk@yahoo-inc.com>
>>> Reply-To: <core-user@hadoop.apache.org>
>>> Date: Tue, 07 Apr 2009 10:05:16 +0530
>>> To: <core-user@hadoop.apache.org>
>>> Subject: Re: Reduce task attempt retry strategy
>>> 
>>> Stefan Will wrote:
>>>> Hi,
>>>> 
>>>> I had a flaky machine the other day that was still accepting jobs and
>>>> sending heartbeats, but caused all reduce task attempts to fail. This in
>>>> turn caused the whole job to fail because the same reduce task was retried
>>>> 3
>>>> times on that particular machine.
>>>>   
>>> What is your cluster size? If a task fails on a machine then its
>>> re-tried on some other machine (based on number of good machines left in
>>> the cluster). After certain number of failures, the machine will be
>>> blacklisted (again based on number of machine left in the cluster). 3
>>> different reducers might be scheduled on that machine but that should
>>> not lead to job failure. Can you explain in detail what exactly
>>> happened. Find out where the attempts got scheduled from the
>>> jobtracker's log.
>>> Amar
>>>> Perhaps I¹m confusing this with the block placement strategy in hdfs, but
I
>>>> always thought that the framework would retry jobs on a different machine
>>>> if
>>>> retries on the original machine keep failing. E.g. I would have expected
to
>>>> retry once or twice on the same machine, but then switch to a different one
>>>> to minimize the likelihood of getting stuck on a bad machine.
>>>> 
>>>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans
for
>>>> improving on this in the future ?
>>>> 
>>>> Thanks,
>>>> Stefan
>>>> 
>>>>   
>> 
>> 



Mime
View raw message