Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <248423173.1210027075597.JavaMail.jira@brutus>
Date: Mon, 5 May 2008 15:37:55 -0700 (PDT)
From: "Arun C Murthy (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Updated: (HADOOP-3333) job failing because of reassigning
 same tasktracker to failing tasks
In-Reply-To: <245009369.1209662815864.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-3333:
----------------------------------

    Attachment: HADOOP-3333_1_20080505.patch

Updated patch, I had to fix JobTracker.ExpireTrackers.run to correctly call JobTracker.lostTaskTracker first before nuking the knowledge about it's existence in JobTracker.updataTaskTrackerStatus.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.