hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks
Date Fri, 02 May 2008 18:00:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593832#action_12593832

Amar Kamat commented on HADOOP-3333:

Yes even I saw these things in the logs. 
bq. However a lost tasktracker leads to tasks being marked KILLED.
As this is different from FAILED, we should probably keep it as it is. Tasktracker can be
lost because of some  transient issues. Blacklisting such trackers for the tips that are local
might not be good. But a tracker getting lost too frequently can be considered for blacklisting.

- Can we do something regarding the machines where the trackers (on different ports) are failing
or getting lost too often?

bq. We also have to track hostnames rather than 'trackernames', trackername includes the host:port...
+1. What can be done is that the TIP can be scheduled to trackers on different machines in
the first go. Then consider scheduling it to TaskTrackers sharing the machines.
One thing I felt was the system was loaded. That could be the possible reason for job failures.
Wondering under what conditions running multiple TaskTrackers is better than running one tracker.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
> We have a long running a job in a 2nd atttempt. Previous job was failing and current
jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned
repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually
running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers
need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message