reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators
Date Mon, 28 Mar 2016 00:56:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213698#comment-15213698
] 

Julia commented on REEF-1223:
-----------------------------

In addition to the system states above, we would also need Task states to control the status
of the task in the flow. 

* TASK_NEW – Task configuration is created and the task is added to the queue waiting for
submitting
* TASK_SUBMITTING – task is submitted
* TASK_RUNNING – received Running Task event
* TASK_WAITING_FOR_CLOSE – Driver send event to close the task 
* TASK_CLOSED_BY_DRIVER – received Failed Task event with message showing the task is closed
by driver
* TASK_FAILED_BY_EVALUATOR_FAILURE – received Failed Evaluator event with the failed task
* TASK_FAILED_COMMUNICATION – received Failed Task event with the message showing the task
failure is caused by not able to get stream messages in the communication group. If a task
is not able to receive message from children, for example, the task should throw TaskException
with specified message for driver to distinct this case. 
* TASK_FAILED_APP_ERROR – received Failed Task event with message shows the error is caused
by application. We need to visit the current code to make sure it throws TaskException with
specified app error message if the error is caused by applications.  


> IMRU Fault Tolerance - restart failed evaluators
> ------------------------------------------------
>
>                 Key: REEF-1223
>                 URL: https://issues.apache.org/jira/browse/REEF-1223
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>
> Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed
for whatever reason, all the Evaluators will be killed by the driver. 
> There are multiple levels of fault tolerant. The scenario we would like to support in
this JIRA is:
> *  When an evaluator failed, this failed evaluator will be killed and other good Evaluators
will stay, but all the tasks running on those Evaluators will be stopped. 
> *  A new Evaluator will be requested and started with the original task. 
> *  Same tasks will be resubmitted to the rest the Evaluators
> *  The topology of those tasks will be kept in the same group communication as before
> *  The data that have been downloaded in those good Evaluators will stay. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message