Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 53786 invoked from network); 23 Mar 2009 05:02:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Mar 2009 05:02:14 -0000 Received: (qmail 71888 invoked by uid 500); 23 Mar 2009 05:02:13 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 71808 invoked by uid 500); 23 Mar 2009 05:02:13 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 71798 invoked by uid 99); 23 Mar 2009 05:02:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2009 05:02:13 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Mar 2009 05:02:12 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 1BD1C234C003 for ; Sun, 22 Mar 2009 22:01:51 -0700 (PDT) Message-ID: <844398914.1237784511109.JavaMail.jira@brutus> Date: Sun, 22 Mar 2009 22:01:51 -0700 (PDT) From: "Amareshwari Sriramadasu (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-5547) One bad node can cause whole job to fail In-Reply-To: <593452821.1237569650779.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688212#action_12688212 ] Amareshwari Sriramadasu commented on HADOOP-5547: ------------------------------------------------- This should not happen, until there are no other nodes in cluster to run the task. Did you have other nodes with free slot on your cluster? > One bad node can cause whole job to fail > ---------------------------------------- > > Key: HADOOP-5547 > URL: https://issues.apache.org/jira/browse/HADOOP-5547 > Project: Hadoop Core > Issue Type: Bug > Reporter: Nathan Marz > > This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems and every task run on it was failing. In general the node would get blacklisted and jobs would run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When the task failed, the task was then retried on the same bad node 3 more times until the job failed. Hadoop should be able to handle this situation better. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.