hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amar Kamat (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3333) job failing because of reassigning same tasktracker to failing tasks
Date Wed, 07 May 2008 05:33:56 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594784#action_12594784
] 

Amar Kamat commented on HADOOP-3333:
------------------------------------

bq. Why dont we do something like .... 
Consider the following
1) Consider a case where the _num-machines_ < _num-trackers_.
||Node||Trackers||
|N1|T1, T2|
|N2|T3, T4|
2) Lets assume a corner case where the TIP fails on atleast one tracker on each node. 
Say TIP t1 fails on trackers T1 and T3.
3) As per the scheduling logic (see line18/19)
{code:title=JobInProgress.java|borderStyle=solid}
1  private synchronized TaskInProgress findTaskFromList(
2     Collection<TaskInProgress> tips, String taskTracker, boolean removeFailedTip)
{
3   Iterator<TaskInProgress> iter = tips.iterator();
4   while (iter.hasNext()) {
5     TaskInProgress tip = iter.next();
6
7      // Select a tip if
8      //   1. runnable   : still needs to be run and is not completed
9      //   2. ~running   : no other node is running it
10      //   3. earlier attempt failed : has not failed on this host
11     //                               and has failed on all the other hosts
12      // A TIP is removed from the list if 
13      // (1) this tip is scheduled
14      // (2) if the passed list is a level 0 (host) cache
15      // (3) when the TIP is non-schedulable (running, killed, complete)
16      if (tip.isRunnable() && !tip.isRunning()) {
17       // check if the tip has failed on this host
18        if (!tip.hasFailedOnMachine(taskTracker) || 
19             tip.getNumberOfFailedMachines() >= clusterSize) {
{code}
The tip _t1_ has failed on 2 machines but the clustersize (# of trackers) is 4 and hence the
job will be stuck. With this patch the {{total-failures-per-tip}} is upper bounded by {{num-nodes}}
while the parameter {{cluster-size}} is upper bounded by {{num-trackers}}.

> job failing because of reassigning same tasktracker to failing tasks
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3333
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3333
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: HADOOP-3333_0_20080503.patch, HADOOP-3333_1_20080505.patch, HADOOP-3333_2_20080506.patch
>
>
> We have a long running a job in a 2nd atttempt. Previous job was failing and current
jobs risks to fail as well, because  reduce tasks failing on marginal TaskTrackers are assigned
repeatedly to the same TaskTrackers (probably because it is the only available slot), eventually
running out of attempts.
> Reduce tasks should be assigned to the same TaskTrackers at most twice, or TaskTrackers
need to get some better smarts to find  failing hardware.
> BTW, mapred.reduce.max.attempts=12, which is high, but does not help in this case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message