hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5985) A single slow (but not dead) map TaskTracker impedes MapReduce progress
Date Tue, 16 Jun 2009 22:10:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720376#action_12720376

Aaron Kimball commented on HADOOP-5985:


Good point regarding where reducers pull from. Since multiple waves of reducers are supported,
it sounds reasonable to me.

So maybe the algorithm changes to this; we modify speculative execution in the following way:

* If a map task is launched multiple times via spec. ex., all copies that succeed are eligible
to serve reducers concurrently, not just one such copy.
* The completion of a map task's processing does not cause speculative copies to be killed;
they also run to completion.
* Mapper TaskTrackers report back to the JT (during their heartbeat) the number of reduce
shards served / available. If any set of mappers are falling "too far behind" the other mappers
(e.g., most mappers have served 900/1000 shards, but a couple have only served 50/1000), then
we launch additional copies of those mapper tasks on other nodes.
* Reducers are advised of additional alternate locations to pull a particular map shard. If
multiple sources are available, they randomly choose one. If that source is too slow, it tries
a different copy.

This gets rid of the need for reducers to "vote" and influence the JT's behavior directly.

> A single slow (but not dead) map TaskTracker impedes MapReduce progress
> -----------------------------------------------------------------------
>                 Key: HADOOP-5985
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5985
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Aaron Kimball
> We see cases where there may be a large number of mapper nodes running many tasks (e.g.,
a thousand). The reducers will pull 980 of the map task intermediate files down, but will
be unable to retrieve the final intermediate shards from the last node. The TaskTracker on
that node returns data to reducers either slowly or not at all, but its heartbeat messages
make it back to the JobTracker -- so the JobTracker doesn't mark the tasks as failed. Manually
stopping the offending TaskTracker works to migrate the tasks to other nodes, where the shuffling
process finishes very quickly. Left on its own, it can take hours to unjam itself otherwise.
> We need a mechanism for reducers to provide feedback to the JobTracker that one of the
mapper nodes should be regarded as lost.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message