Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 52868 invoked from network); 7 Jun 2008 01:28:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Jun 2008 01:28:26 -0000 Received: (qmail 5449 invoked by uid 500); 7 Jun 2008 01:27:55 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 5382 invoked by uid 500); 7 Jun 2008 01:27:54 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 5332 invoked by uid 99); 7 Jun 2008 01:27:54 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jun 2008 18:27:54 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Jun 2008 01:27:05 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id DDF29234C192 for ; Fri, 6 Jun 2008 18:26:46 -0700 (PDT) Message-ID: <1975317553.1212802006908.JavaMail.jira@brutus> Date: Fri, 6 Jun 2008 18:26:46 -0700 (PDT) From: "Mukund Madhugiri (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks In-Reply-To: <245009369.1209662815864.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukund Madhugiri updated HADOOP-3333: ------------------------------------- Fix Version/s: (was: 0.18.0) > job failing because of reassigning same tasktracker to failing tasks > -------------------------------------------------------------------- > > Key: HADOOP-3333 > URL: https://issues.apache.org/jira/browse/HADOOP-3333 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.16.3 > Reporter: Christian Kunz > Assignee: Arun C Murthy > Priority: Critical > Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch > > > We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts. > Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find failing hardware. > BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.