hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1158) JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
Date Wed, 06 Jun 2007 09:12:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501859
] 

Arun C Murthy commented on HADOOP-1158:
---------------------------------------

bq. The reduce should kill itself when it fails to fetch the map output from even the new
location, i.e., the unique 5 faulty fetches should have at least 1 retrial (i.e., we don't
kill a reduce too early).

Though it makes sense in the long-term I'd vote we keep it simple for now... to implement
this would entail more complex code and more state to be maintained. 5 notifications anyway
mean that the reducer has seen 20 attempts to fetch on 5 different maps fail. I'd say, for
now, it's a sufficient reason to kill the reducer.

bq.Also, does it make sense to have the logic behind killing/reexecuting reduces in the JobTracker.
Two reasons:
bq.1) since the JobTracker knows very well how many times a reduce complained, and, for which
maps it complained, etc.,

If the reducer kills itself, the JobTracker need not maintain information of *which* reduces
failed to fetch *which* maps, it could just do with a per-taskid count of failed fetches (for
the maps, as notified by reducers) - again leads to simpler code for a first-shot. 

bq.2) consistent behavior - jobtracker handles the reexecution of maps and it might handle
the reexecution of reduces as well.

I agree with the general sentiment, but given that this leads to more complex code and the
reducer already knows it has failed to fetch from 5 different maps it doesn't make sense for
it to wait for the JobTracker to fail the task. Also, there is an existing precedent for this
behaviour in TaskTracker.fsError (task is marked as 'failed' by the TaskTracker itself on
an FSError).

Thoughts?

> JobTracker should collect statistics of failed map output fetches, and take decisions
to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1158
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.2
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>
> The JobTracker should keep a track (with feedback from Reducers) of how many times a
fetch for a particular map output failed. If this exceeds a certain threshold, then that map
should be declared as lost, and should be reexecuted elsewhere. Based on the number of such
complaints from Reducers, the JobTracker can blacklist the TaskTracker. This will make the
framework reliable - it will take care of (faulty) TaskTrackers that sometimes always fail
to serve up map outputs (for which exceptions are not properly raised/handled, for e.g., if
the exception/problem happens in the Jetty server).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message