reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrey (JIRA)" <j...@apache.org>
Subject [jira] [Created] (REEF-1511) timeout for Task Shutdown during IMRU recovery
Date Tue, 09 Aug 2016 17:13:20 GMT
Andrey created REEF-1511:
----------------------------

             Summary: timeout for Task Shutdown during IMRU recovery
                 Key: REEF-1511
                 URL: https://issues.apache.org/jira/browse/REEF-1511
             Project: REEF
          Issue Type: Improvement
          Components: IMRU
            Reporter: Andrey


This related to fault tolerance implementation in PR-1251.
Currently recovery logic in IMRU driver is to wait for all task to move to a final state (failed
or completed) before restarting the job check AreAllTasksInFinalState() in TryRecovery() method)
We've seen driver hanging for a long time waiting for few last tasks finalize.
Aborting tasks should be quick, so there is bug there, but we also can add logic in driver
not to wait for all tasks to complete.
For instance: if 5% of tasks did not report final state withing expected period, release corresponding
evaluators  and proceed with new job retry.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message