hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Kunz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks
Date Fri, 02 May 2008 14:54:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593793#action_12593793

Christian Kunz commented on HADOOP-3333:

Number of blacklisted TaskTrackers is low (less than 1%), because we have a high threshold
(100 failures) for TaskTrackers to be declared blacklisted. In the past, with the default
setting, we lost too many TaskTrackers too fast even when there were no hardware issues --
but this might have been fixed and we might want to change this back to a more reasonable
value. On the other hand, we did not have any problems using the high value till 0.16.3.

With a 'marginal' TaskTracker I mean a TaskTracker running on a node with hardware failures,
that still runs most short tasks successfully, but with a higher chance of failing long running
tasks (e.g. reduce tasks shuffling the map outputs from many waves of short map tasks).
Concerning 'repeatedly same tasks assigned to same Tasktracker', I can point you to a running
job offline exhibiting the problem.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Priority: Blocker
> We have a long running a job in a 2nd atttempt. Previous job was failing and current
jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned
repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually
running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers
need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message