hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yoram Arnon (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-654) jobs fail with some hardware/system failures on a small number of nodes
Date Mon, 30 Oct 2006 18:48:16 GMT
jobs fail with some hardware/system failures on a small number of nodes

                 Key: HADOOP-654
                 URL: http://issues.apache.org/jira/browse/HADOOP-654
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.7.2
            Reporter: Yoram Arnon
         Assigned To: Owen O'Malley
            Priority: Minor

occasionally, such as when the OS is out of some resource, a node fails only partly. The node
is up and running, the task tracker is running and sending heartbeats, but every task fails
because the tasktracker can't fork tasks or something.
In these cases, that task tracker keeps getting assigned tasks to execute, and they all fail.
A couple of nodes like that and jobs start failing badly.

The job tracker should avoid assigning tasks to tasktrackers that are misbehaving.

simple approach: avoid tasktrackers that report many more failures than average (say 3X).
Simply use the info sent by the TT.
better but harder: track TT failures over time and:
 1. avoid those that exhibit a high failure *rate*
 2. tell them to shut down

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message