hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jothi Padmanabhan <joth...@yahoo-inc.com>
Subject Re: Reduce task attempt retry strategy
Date Tue, 14 Apr 2009 03:41:54 GMT
Usually, a task is killed when
1. User explicitly kills the task himself
2. Framework kills the task because it did not progress enough
3. Tasks that were speculatively executed

Hence the reason for killing has, more often than not, nothing to do with
the health of the node where it was running, but rather with the task (user
code) itself. It is very difficult to distinguish the case where progress
was not reported because the user code was faulty to the case where progress
was not reported because the node was slow.

Jothi

On 4/13/09 10:47 PM, "Stefan Will" <stefan.will@gmx.net> wrote:

> Jothi, thanks for the explanation. One question though: why shouldn't timed
> out tasks be retried on a different machine ? As you pointed out, it could
> very well have been due to the machine having problems. To me a timeout is
> just like any other kind of failure.
> 
> -- Stefan
> 
> 
>> From: Jothi Padmanabhan <jothipn@yahoo-inc.com>
>> Reply-To: <core-user@hadoop.apache.org>
>> Date: Mon, 13 Apr 2009 19:00:38 +0530
>> To: <core-user@hadoop.apache.org>
>> Subject: Re: Reduce task attempt retry strategy
>> 
>> Currently, only failed tasks are attempted on a node other than the one
>> where it failed. For killed tasks, there is no such policy for retries.
>> 
>> "failed to report status" usually indicates that the task did not report
>> sufficient progress. However, it is possible that the task itself was not
>> progressing fast enough because the machine where it ran had problems.
>> 
>> 
>> On 4/8/09 12:33 AM, "Stefan Will" <stefan.will@gmx.net> wrote:
>> 
>>> My cluster has 27 nodes with a total reduce task capacity of 54. The job had
>>> 31 reducers. I actually had a task today that showed the behavior you're
>>> describing: 3 tries on one machine, and then the 4th on a different one.
>>> 
>>> As for the particular job I was talking about before:
>>> 
>>> Here are the stats for the job:
>>> 
>>> Kind    Total Tasks(successful+failed+killed)    Successful tasks    Failed
>>> tasks    Killed tasks    Start Time    Finish Time
>>> Setup     1     1     0     0     4-Apr-2009 00:30:16     4-Apr-2009
>>> 00:30:33 (17sec)
>>> Map     64     49     12     3     4-Apr-2009 00:30:33     4-Apr-2009
>>> 01:11:15 (40mins, 41sec)
>>> Reduce     34     30     4     0     4-Apr-2009 00:30:44     4-Apr-2009
>>> 04:31:36 (4hrs, 52sec)
>>> Cleanup     4     0     4     0     4-Apr-2009 04:31:36     4-Apr-2009
>>> 06:32:00 (2hrs, 24sec)
>>> 
>>> 
>>> Not sure what to look for in the jobtracker log. All it shows for that
>>> particular failed task is that it assigned it to the same machine 4 times
>>> and then eventually failed. Perhaps something to note is that the 4 failures
>>> were all due to timeouts:
>>> 
>>> "Task attempt_200904031942_0002_r_000013_3 failed to report status for 1802
>>> seconds. Killing!"
>>> 
>>> Also, looking at the logs, there was a map task too that was retried on that
>>> particuar box 4 times without going to a different one. Perhaps it had
>>> something to do with the way this machine failed: The jobtracker still
>>> considered it live, while all actual tasks assigned to it timed out.
>>> 
>>> -- Stefan
>>> 
>>> 
>>> 
>>>> From: Amar Kamat <amarrk@yahoo-inc.com>
>>>> Reply-To: <core-user@hadoop.apache.org>
>>>> Date: Tue, 07 Apr 2009 10:05:16 +0530
>>>> To: <core-user@hadoop.apache.org>
>>>> Subject: Re: Reduce task attempt retry strategy
>>>> 
>>>> Stefan Will wrote:
>>>>> Hi,
>>>>> 
>>>>> I had a flaky machine the other day that was still accepting jobs and
>>>>> sending heartbeats, but caused all reduce task attempts to fail. This
in
>>>>> turn caused the whole job to fail because the same reduce task was retried
>>>>> 3
>>>>> times on that particular machine.
>>>>>   
>>>> What is your cluster size? If a task fails on a machine then its
>>>> re-tried on some other machine (based on number of good machines left in
>>>> the cluster). After certain number of failures, the machine will be
>>>> blacklisted (again based on number of machine left in the cluster). 3
>>>> different reducers might be scheduled on that machine but that should
>>>> not lead to job failure. Can you explain in detail what exactly
>>>> happened. Find out where the attempts got scheduled from the
>>>> jobtracker's log.
>>>> Amar
>>>>> Perhaps I¹m confusing this with the block placement strategy in hdfs,
but
>>>>> I
>>>>> always thought that the framework would retry jobs on a different machine
>>>>> if
>>>>> retries on the original machine keep failing. E.g. I would have expected
>>>>> to
>>>>> retry once or twice on the same machine, but then switch to a different
>>>>> one
>>>>> to minimize the likelihood of getting stuck on a bad machine.
>>>>> 
>>>>> What is the expected behavior in 0.19.1 (which I¹m running) ? Any plans
>>>>> for
>>>>> improving on this in the future ?
>>>>> 
>>>>> Thanks,
>>>>> Stefan
>>>>> 
>>>>>   
>>> 
>>> 
> 
> 


Mime
View raw message