reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1335) Create State Machine for IMRU fault tolerance
Date Mon, 25 Apr 2016 21:33:12 GMT


Markus Weimer commented on REEF-1335:

A bit background: We actually had that use the same thread pool for all Evaluators in the
pre-Apache days. That was nice, because an individual bad event handler could not block all
events. However, it created a lot of challenges when events came in out-of-order on a per-Evaluator
basis. Hence, we arrived at this design with one thread pool per Evaluator with default size
of 1.

This is of course questionable. An alternative and likely better design would guarantee event
order per-Evaluator, but still use a shared thread pool whose size is independent of the number
of active Evaluators.

> Create State Machine for IMRU fault tolerance
> ---------------------------------------------
>                 Key: REEF-1335
>                 URL:
>             Project: REEF
>          Issue Type: Task
>          Components: IMRU, REEF.NET
>            Reporter: Julia
>            Assignee: Julia
>              Labels: FT
>             Fix For: 0.15
>         Attachments: REEF Fault Tolerant Technical design.docx
> To Support fault tolerant, we would like to use state machine to control the system state
> After driver is created, it will start from request evaluators and submit contexts state;
after all the contexts are ready, it will move to submitting tasks state; when all the tasks
are start running, it moves to tasks running state; when all the tasks are completed, the
state will be changed to tasks completed. If either tasks or evaluators fail, it will change
to shut down state, etc. 
> Here are the proposed system states:
> * WaitingForEvaluator,
> * SubmitingTasks,
> * TasksRunning,
> * TasksCompleted,
> * ShutingDown,
> * Fail
> Here are the event that may trigger the state change:
> * AllContextsAreReady,
> * AllTasksAreRunning,
> * AllTasksAreCompleted,
> * FailedTask,
> * FailedEvaluator,
> * NotRecoverable,
> * Recover

This message was sent by Atlassian JIRA

View raw message