hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
Date Thu, 07 Jun 2007 03:28:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501870
] 

Arun C Murthy edited comment on HADOOP-1158 at 6/6/07 8:27 PM:
---------------------------------------------------------------

bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to
the JobTracker via a new rpc:

I take that back.. I propose we augument TaskStatus itself to let the JobTracker know about
the failed-fetches i.e. map taskids. 

We could just add an new RPC to TaskUmbilicalProtocol for the reduce-task to let the TaskTracker
know about the failed fetch... 
{code:title=TaskUmbilical.java}
void fetchError(String taskId, String failedFetchMapTaskId);
{code}

Even better, a tad more involved, is to rework 
{code:title=TaskUmbilical.java}
  void progress(String taskid, float progress, String state, 
                            TaskStatus.Phase phase, Counters counters)
   throws IOException, InterruptedException;
{code}
as
{code:title=TaskUmbilical.java}
  void progress(String taskid, TaskStatus taskStatus}
   throws IOException, InterruptedException;
{code}

This simplies the flow so that the child-vm itself computes it's {{TaskStatus}} (which will
be augumented to contain the failed-fetch-mapIds) and sends it along the {{TaskTracker}} which
just forwards it to the {{JobTracker}}, thereby relieving it of some of the responsibilities
vis-a-vis computing the {{TaskStatus}}. Clearly this could be linked to the the reporting
re-design at HADOOP-1462 ...

Thoughts?


 was:
bq.b) Given sufficient fetch-failures per-map (say 3 or 4), the reducer then complains to
the JobTracker via a new rpc:

I take that back, I'm propose we use augument TaskStatus itself to let the JobTracker know
about the failed-fetches i.e. map taskids, we could just add an new RPC to TaskUmbilicalProtocol
for the reduce-task to let the TaskTracker know about the failed fetch.

> JobTracker should collect statistics of failed map output fetches, and take decisions
to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a
fetch for a particular map output failed. If this exceeds a certain threshold, then that map
should be declared as lost, and should be reexecuted elsewhere. Based on the number of such
complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the
framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail
to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if
the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message