hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-181) task trackers should not restart for having a late heartbeat
Date Sat, 12 Aug 2006 06:45:15 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427680 ] 
            
Owen O'Malley commented on HADOOP-181:
--------------------------------------

I agree that the original desire for this patch was born of the TaskTracker timeouts that
shouldn't happen. Fixing those problems (and we _have_ fixed most of them over the last 4
months) should take precendence. However, that said, I think in the long term, we do want
something like this patch. If a switch goes down for 15 minutes and then comes back up, it
does not make sense to reshuffle, resort, and rerun a reduce that takes hours to run.

All map/reduce applications, even those with speculative execution turned off, must permit
redundant copies of their tasks for precisely this reason. In this case, the JobTracker has
decided a given task is dead, but hasn't been able to tell the responsible TaskTracker yet.
Therefore it schedules another instance of the failed task on a different node. Therefore,
they are going to run in parallel for a while.

I guess for now, let's sit on this patch and contemplate what the model should be for dealing
with communication problems. We should also monitor this in real use and see how often task
trackers are being lost and probably put some effort to determine at least whether it is the
job tracker or the task tracker that is the cause of the delay.

> task trackers should not restart for having a late heartbeat
> ------------------------------------------------------------
>
>                 Key: HADOOP-181
>                 URL: http://issues.apache.org/jira/browse/HADOOP-181
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.6.0
>
>         Attachments: lost-heartbeat.patch
>
>
> TaskTrackers should not close and restart themselves for having a late heartbeat. The
JobTracker should just accept their current status.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message