hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-142) failed tasks should be rescheduled on different hosts after other jobs
Date Mon, 17 Apr 2006 22:26:17 GMT
failed tasks should be rescheduled on different hosts after other jobs
----------------------------------------------------------------------

         Key: HADOOP-142
         URL: http://issues.apache.org/jira/browse/HADOOP-142
     Project: Hadoop
        Type: Improvement

  Components: mapred  
    Versions: 0.1.1    
    Reporter: Owen O'Malley
 Assigned to: Owen O'Malley 
     Fix For: 0.2


Currently when tasks fail, they are usually rerun immediately on the same host. This causes
problems in a couple of ways. 
  1.The task is more likely to fail on the same host. 
  2.If there is cleanup code (such as clearing pendingCreates) it does not always run immediately,
leading to cascading failures.

For a first pass, I propose that when a task fails, we start the scan for new tasks to launch
at the following task of the same type (within that job). So if maps[99] fails, when we are
looking to assign new map tasks from this job, we scan like maps[100]...maps[N], maps[0]..,maps[99].

A more involved change would avoid running tasks on nodes where it has failed before. This
is a little tricky, because you don't want to prevent re-excution of tasks on 1 node clusters
and the job tracker needs to schedule one task tracker at a time.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message