reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1223) IMRU Fault Tolerance - restart failed evaluators
Date Wed, 23 Mar 2016 16:25:26 GMT

    [ https://issues.apache.org/jira/browse/REEF-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208702#comment-15208702
] 

Markus Weimer commented on REEF-1223:
-------------------------------------

One approach to tackle this is to refactor the current IMRU Driver around a state machine.
Overly simplified, IMRU can be in a couple of different states:

  * {{WAITING_FOR_EVALUATORS}}: In this state, we are waiting for Evaluators to be allocated
and for data to be loaded into them. This is the state the Driver starts in. It is also the
state the Driver is in after suffering a failure. In this state, the event handler of {{IAllocatedEvaluator}}
submits the data loading context, the handler for {{IActiveContext}} collects the {{IActiveContext}}
instances into a collection. Once that collection has the required number of entries, the
Driver enters the next state:
  * {{SUBMITTING_TASKS}}: In this state, we submit all the tasks to the contexts. This is
when we define the communication group and its topology. Note: This is different from today,
where we define the communication groups in the driver's constructor. In the handler for {{IRunningTask}},
we again count the number of received events. Once we have all the expected tasks, we enter
the next state:
  * {{TASKS_RUNNING}}: This is when all tasks are running. If we receive a failure in this
state, we clean up and enter {{WAITING_FOR_EVALUATORS}} again.

In this way, the various event handlers all contain a {{switch(DRIVER_STATE)...}} and take
the appropriate action for that state. We'd probably want to lock all of them on the state
object itself, as they are all likely to change it.

Makes sense?

> IMRU Fault Tolerance - restart failed evaluators
> ------------------------------------------------
>
>                 Key: REEF-1223
>                 URL: https://issues.apache.org/jira/browse/REEF-1223
>             Project: REEF
>          Issue Type: New Feature
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>
> Currently in .Net Group Communication and IMRU scenario, if one of the Evaluator failed
for whatever reason, all the Evaluators will be killed by the driver. 
> There are multiple levels of fault tolerant. The scenario we would like to support in
this JIRA is:
> *  When an evaluator failed, this failed evaluator will be killed and other good Evaluators
will stay, but all the tasks running on those Evaluators will be stopped. 
> *  A new Evaluator will be requested and started with the original task. 
> *  Same tasks will be resubmitted to the rest the Evaluators
> *  The topology of those tasks will be kept in the same group communication as before
> *  The data that have been downloaded in those good Evaluators will stay. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message