hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
Date Tue, 05 Jun 2007 15:28:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501591
] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

Some early thoughts...

Bottomline: we don't want the reducer and hence the job to get stuck forever. 

The main issue is that when a reducer is stuck in shuffle it's hard to accurately say whether
the fault lies at the map (jetty acting weird) or at the reduce or both. Having said that
it's pertinent to keep in mind that _normally_ maps are cheaper to re-execute.

Given the above I'd like to propose something along these lines:

a) The reduce maintains a per-map count of fetch failures.

b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to the
JobTracker via a new rpc: 
{code:title=JobTracker.java}
public synchronized void notifyFailedFetch(String reduceTaskId, String mapTaskId) {
  // ...
}
{code}

c) The JobTracker maintains a per-map count of failed-fetch notfications, and given a sufficient
no. of them (say 2/3?) from *any* reducer (even multiple times from the same reducer) fails
the map and re-schedules it elsewhere.
  
  This handles 2 cases: a) Faulty maps are re-executed and b) Corner case where only the last
reducer is stuck on a given map and hence the map will have to be re-executed.

d) To counter the case of faulty reduces, we could implement a scheme where the reducer kills
itself when it notifies the JobTracker of more than, say 5 unique, faulty fetches. This will
ensure that a faulty reducer will not result in the JobTracker spawning maps willy-nilly...

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take decisions
to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a
fetch for a particular map output failed. If this exceeds a certain threshold, then that map
should be declared as lost, and should be reexecuted elsewhere. Based on the number of such
complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the
framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail
to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if
the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message