hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-181) task trackers should not restart for having a late heartbeat
Date Mon, 14 Aug 2006 19:53:16 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-181?page=comments#action_12427954 ] 
Doug Cutting commented on HADOOP-181:

> If a switch goes down for 15 minutes [ ... ]

We'll currently have a lot of other problems if a switch goes down for 15 minutes.  All of
the other tasks will probably fail because DFS will no longer have complete copies of files.

Is a switch going down for 15 minutes really a case we need to optimize?  Is it acceptable
to lose a few hours work on its hosts when a switch dies?  When a switch fails, how long does
it take to replace?

We can answer some of this fairly precisely.  What is the MTBF for switches?  How many switches
would we have in a 10k-node system?

> task trackers should not restart for having a late heartbeat
> ------------------------------------------------------------
>                 Key: HADOOP-181
>                 URL: http://issues.apache.org/jira/browse/HADOOP-181
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.6.0
>         Attachments: lost-heartbeat.patch
> TaskTrackers should not close and restart themselves for having a late heartbeat. The
JobTracker should just accept their current status.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message