Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <350942909.1206565164634.JavaMail.jira@brutus>
Date: Wed, 26 Mar 2008 13:59:24 -0700 (PDT)
From: "Devaraj Das (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-2175) Blacklisted hosts may not be able
 to serve map outputs
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582425#action_12582425 ] 

Devaraj Das commented on HADOOP-2175:
-------------------------------------

I agree with Sameer. We should probably step back and look at the model of killing a map based on fetch failure notifications. Today, we do killing of maps based on fetch failure notifications on a per map basis and we wait for a majority of the reducers to tell the JobTracker about the fetch failing for a particular map. 
With the random ordering of map output fetches and the backoff per failed fetch, this might take a long time per map. This is what you observed Runping, IMO.
Instead we probably should include the tracker name on which map ran in the logic for killing a map - if we get too many fetch failure notifications for maps that ran on a particular tracker, which we will detect much faster, we should probably kill those maps that ran on that tracker, for which we are seeing fetch failure notifications. That will take care of the case where only the jetty is faulty (the tracker is not blacklisted as it could, and probably still can, execute tasks).

> Blacklisted hosts may not be able to serve map outputs
> ------------------------------------------------------
>
>                 Key: HADOOP-2175
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2175
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch
>
>
> After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks.
> But, it will continue serve the map outputs of any mappers that ran successfully there. 
> However, the node may not be able serve the map outputs either. 
> This will cause the reducers to mark the corresponding map outputs as from slow hosts, 
> but continue to try to get the map outputs from that node.
> This may lead to waiting forever.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.