hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HADOOP-142) failed tasks should be rescheduled on different hosts after other jobs
Date Tue, 18 Apr 2006 22:09:24 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-142?page=all ]
Doug Cutting resolved HADOOP-142:

    Resolution: Fixed

I just committed this.  Thanks, Owen!

> failed tasks should be rescheduled on different hosts after other jobs
> ----------------------------------------------------------------------
>          Key: HADOOP-142
>          URL: http://issues.apache.org/jira/browse/HADOOP-142
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.1.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.2
>  Attachments: no-repeat-failures.patch
> Currently when tasks fail, they are usually rerun immediately on the same host. This
causes problems in a couple of ways. 
>   1.The task is more likely to fail on the same host. 
>   2.If there is cleanup code (such as clearing pendingCreates) it does not always run
immediately, leading to cascading failures.
> For a first pass, I propose that when a task fails, we start the scan for new tasks to
launch at the following task of the same type (within that job). So if maps[99] fails, when
we are looking to assign new map tasks from this job, we scan like maps[100]...maps[N], maps[0]..,maps[99].
> A more involved change would avoid running tasks on nodes where it has failed before.
This is a little tricky, because you don't want to prevent re-excution of tasks on 1 node
clusters and the job tracker needs to schedule one task tracker at a time.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message