hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2175) Blacklisted hosts may not be able to serve map outputs
Date Wed, 26 Mar 2008 20:59:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582425#action_12582425

Devaraj Das commented on HADOOP-2175:

I agree with Sameer. We should probably step back and look at the model of killing a map based
on fetch failure notifications. Today, we do killing of maps based on fetch failure notifications
on a per map basis and we wait for a majority of the reducers to tell the JobTracker about
the fetch failing for a particular map. 
With the random ordering of map output fetches and the backoff per failed fetch, this might
take a long time per map. This is what you observed Runping, IMO.
Instead we probably should include the tracker name on which map ran in the logic for killing
a map - if we get too many fetch failure notifications for maps that ran on a particular tracker,
which we will detect much faster, we should probably kill those maps that ran on that tracker,
for which we are seeing fetch failure notifications. That will take care of the case where
only the jetty is faulty (the tracker is not blacklisted as it could, and probably still can,
execute tasks).

> Blacklisted hosts may not be able to serve map outputs
> ------------------------------------------------------
>                 Key: HADOOP-2175
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2175
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>             Fix For: 0.18.0
>         Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch,
> After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer
accept tasks.
> But, it will continue serve the map outputs of any mappers that ran successfully there.

> However, the node may not be able serve the map outputs either. 
> This will cause the reducers to mark the corresponding map outputs as from slow hosts,

> but continue to try to get the map outputs from that node.
> This may lead to waiting forever.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message