hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Marz (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5547) One bad node can cause whole job to fail
Date Fri, 20 Mar 2009 17:20:50 GMT
One bad node can cause whole job to fail

                 Key: HADOOP-5547
                 URL: https://issues.apache.org/jira/browse/HADOOP-5547
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Nathan Marz

This happened on the 0.19.2 branch. One of the nodes in our cluster was having disk problems
and every task run on it was failing. In general the node would get blacklisted and jobs would
run fine on it. However, for one job, the job ran the "Job setup" task on this bad node. When
the task failed, the task was then retried on the same bad node 3 more times until the job
failed. Hadoop should be able to handle this situation better.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message