Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 55549 invoked from network); 26 Mar 2008 21:01:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Mar 2008 21:01:38 -0000 Received: (qmail 43834 invoked by uid 500); 26 Mar 2008 21:01:35 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 43808 invoked by uid 500); 26 Mar 2008 21:01:35 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 43799 invoked by uid 99); 26 Mar 2008 21:01:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 14:01:35 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Mar 2008 21:00:53 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9B203234C0A8 for ; Wed, 26 Mar 2008 13:59:24 -0700 (PDT) Message-ID: <350942909.1206565164634.JavaMail.jira@brutus> Date: Wed, 26 Mar 2008 13:59:24 -0700 (PDT) From: "Devaraj Das (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-2175) Blacklisted hosts may not be able to serve map outputs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582425#action_12582425 ] Devaraj Das commented on HADOOP-2175: ------------------------------------- I agree with Sameer. We should probably step back and look at the model of killing a map based on fetch failure notifications. Today, we do killing of maps based on fetch failure notifications on a per map basis and we wait for a majority of the reducers to tell the JobTracker about the fetch failing for a particular map. With the random ordering of map output fetches and the backoff per failed fetch, this might take a long time per map. This is what you observed Runping, IMO. Instead we probably should include the tracker name on which map ran in the logic for killing a map - if we get too many fetch failure notifications for maps that ran on a particular tracker, which we will detect much faster, we should probably kill those maps that ran on that tracker, for which we are seeing fetch failure notifications. That will take care of the case where only the jetty is faulty (the tracker is not blacklisted as it could, and probably still can, execute tasks). > Blacklisted hosts may not be able to serve map outputs > ------------------------------------------------------ > > Key: HADOOP-2175 > URL: https://issues.apache.org/jira/browse/HADOOP-2175 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Runping Qi > Assignee: Amar Kamat > Fix For: 0.18.0 > > Attachments: HADOOP-2175-v1.1.patch, HADOOP-2175-v1.patch, HADOOP-2175-v2.patch, HADOOP-2175-v2.patch > > > After a node fails 4 mappers (tasks), it is added to blacklist thus it will no longer accept tasks. > But, it will continue serve the map outputs of any mappers that ran successfully there. > However, the node may not be able serve the map outputs either. > This will cause the reducers to mark the corresponding map outputs as from slow hosts, > but continue to try to get the map outputs from that node. > This may lead to waiting forever. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.