Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0109870BB for ; Thu, 11 Aug 2011 18:10:57 +0000 (UTC) Received: (qmail 88116 invoked by uid 500); 11 Aug 2011 18:10:56 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 87867 invoked by uid 500); 11 Aug 2011 18:10:56 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 87859 invoked by uid 99); 11 Aug 2011 18:10:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Aug 2011 18:10:55 +0000 X-ASF-Spam-Status: No, hits=-2000.8 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Aug 2011 18:10:53 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 09E04B85D1 for ; Thu, 11 Aug 2011 18:10:32 +0000 (UTC) Date: Thu, 11 Aug 2011 18:10:32 +0000 (UTC) From: "Eli Collins (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <1534874403.29029.1313086232037.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Resolved] (HADOOP-5547) One bad node can cause whole job to fail MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins resolved HADOOP-5547. --------------------------------- Resolution: Won't Fix Out of date > One bad node can cause whole job to fail > ---------------------------------------- > > Key: HADOOP-5547 > URL: https://issues.apache.org/jira/browse/HADOOP-5547 > Project: Hadoop Common > Issue Type: Bug > Reporter: Nathan Marz > > This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems and every task run on it was failing. In general the node would get blacklisted and jobs would run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When the task failed, the task was then retried on the same bad node 3 more times until the job failed. Hadoop should be able to handle this situation better. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira