hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MaoYuan Xian (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAMA-793) Job failed to recovery when more than one tasks fail at the same time even when fault tolerant enabled.
Date Mon, 12 Aug 2013 04:34:47 GMT
MaoYuan Xian created HAMA-793:
---------------------------------

             Summary: Job failed to recovery when more than one tasks fail at the same time
even when fault tolerant enabled.
                 Key: HAMA-793
                 URL: https://issues.apache.org/jira/browse/HAMA-793
             Project: Hama
          Issue Type: Bug
          Components: bsp core
    Affects Versions: 0.6.2
            Reporter: MaoYuan Xian
            Priority: Minor


I can find the fault tolerant does not work when more than one tasks fail at the same time
during a job running.

The reason is, in the schedule method of SimpleTaskScheduler, when finds the jobresult equals
to false, job.kill called, and than JobInProgress.garbageCollection triggered, job directory
is clean and makes the recovery job fail.

I made the following modifications in the SimpleTaskScheduler and avoid the problem. But not
sure whether it is the comprehensive solution:
{code}
-      if (Boolean.FALSE.equals(jobResult)) {
+      if ((Boolean.FALSE.equals(jobResult))
+          && (job.getStatus().getRunState() != JobStatus.RECOVERING)) {
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message