hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks
Date Fri, 02 May 2008 16:24:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593807#action_12593807
] 

Arun C Murthy commented on HADOOP-3333:
---------------------------------------

Here are the symptoms and possible remedies...

1. The same TIP FAILED on a previously 'lost' tasktracker.
2. The same TIP FAILED on the same machine, however the tasktracker had a different 'port'.
i.e. Failed on x.y.z:30342 and x.y.z:34223

So, a couple of thoughts:
1. We might have to rework the logic to work around task FAILURES; currently the JT only schedules
around nodes where the task FAILED. However a lost tasktracker leads to tasks being marked
KILLED.
2. We also have to track hostnames rather than 'trackernames', trackername includes the host:port...
(#2)

Thoughts?

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current
jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned
repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually
running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers
need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message