reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators
Date Tue, 17 May 2016 01:00:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15285772#comment-15285772
] 

Julia commented on REEF-1223:
-----------------------------

We only have two states, one for system state that is controlled in driver, the other is task
state that is controlled in Task Manager. EvaluatorManager and Context manager don't manage
any states. They simply maintain the Evaluator and Context collections so that in fault tolerant
scenarios, we would know which one failed which one is allocated, etc.

For errors, from fault tolerant perspective, yes, eventually one is recoverable, the other
is not recoverable. If we want to consolidate them in to system error and application error,
that is simple. But if we would like to have little big more granularity like group communication
error, system error or error caused by failed evaluator for log or statics purpose in future,
the cost is not high. 


> IMRU Fault Tolerance - restart failed evaluators
> ------------------------------------------------
>
>                 Key: REEF-1223
>                 URL: https://issues.apache.org/jira/browse/REEF-1223
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>         Attachments: REEF Fault Tolerant Technical design.docx
>
>
> Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed
for whatever reason, all the Evaluators will be killed by the driver. 
> There are multiple levels of fault tolerant. The scenario we would like to support in
this JIRA is:
> *  When an evaluator failed, this failed evaluator will be killed and other good Evaluators
will stay, but all the tasks running on those Evaluators will be stopped. 
> *  A new Evaluator will be requested and started with the original task. 
> *  Same tasks will be resubmitted to the rest the Evaluators
> *  The topology of those tasks will be kept in the same group communication as before
> *  The data that have been downloaded in those good Evaluators will stay. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message