reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators
Date Tue, 22 Mar 2016 04:25:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205766#comment-15205766
] 

Julia commented on REEF-1223:
-----------------------------

There are two big aspects of the work:
 
Identifying missing REEF features for supporting Fault Tolerant and implement them. Example
is to support Close event handler in REEF .Net. Handle Failed Evaluators gracefully instead
of kill all the others. 
 
Fault Tolerant flow. We could put all the detail implementation inside handlers in the driver
during the implementation, that is what my POC looks like actually. That means every driver
needs to re-write the fault tolerant logic by itself. If possible, we could also abstract
the common logic such as managing failure evaluators, monitor task status, etc in a separate
class. This (those) class can be at REEF/GC level and driver just needs to use it instead
of writing a huge fat driver. That was what I talked in the scrum today. 


> IMRU Fault Tolerance - restart failed evaluators
> ------------------------------------------------
>
>                 Key: REEF-1223
>                 URL: https://issues.apache.org/jira/browse/REEF-1223
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>
> Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed
for whatever reason, all the Evaluators will be killed by the driver. 
> There are multiple levels of fault tolerant. The scenario we would like to support in
this JIRA is:
> *  When an evaluator failed, this failed evaluator will be killed and other good Evaluators
will stay, but all the tasks running on those Evaluators will be stopped. 
> *  A new Evaluator will be requested and started with the original task. 
> *  Same tasks will be resubmitted to the rest the Evaluators
> *  The topology of those tasks will be kept in the same group communication as before
> *  The data that have been downloaded in those good Evaluators will stay. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message