reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saikat Kanjilal <>
Subject Re: REEF scheduler: Tasks groups and fault tolerance
Date Wed, 12 Jul 2017 16:26:35 GMT
Hi Matteo,
This is great, I'm trying to wrap my head around this and give more
concrete feedback but am looking for 2 additional things, is it possible
 to see like an interaction diagram between all the components you identify
above and an additional explanation around where it fits within the current
workflow, being somewhat new to reef this would help me understand the
overall workflow and help drive this proposal forward.   Also should we
create a JIRA around this?
Thanks in advance for doing this.

On Tue, Jul 11, 2017 at 8:25 PM, Matteo Interlandi <>

> Hi,
> I am working a scheduling layer on REEF for managing group of tasks
> (TaskSets) and related fault-tolerance policies and mechanisms. I would
> like to share the design I have in mind with you and gather some feedbacks.
> The idea is to have the usual two level scheduling (physical and logical
> task scheduler). At the logical level I have a *service* provider and a set
> of *subscriptions*. Subscriptions contains a pipeline of *operators*. For
> the moment I am only considering group communication operators plus
> *iterate*. Each operator (as in group communication) takes as a parameter a
> *topology*, plus a user defined *failure state machine*, and a *checkpoint
> policy*. The failure machine implements different response on the event of
> a failure based on how many data points are lost as a consequence of a
> failure (as a form or ratio between lost and initial data points). Lost
> data points are computed by looking at the topology: for instance in case
> of a broadcast under the tree topology, if a receiver task fails, one data
> point is lost; conversely if a non-leaf node goes down all data points
> under its branch are lost.
> Tasks are grouped into *task sets*. Each task set subscribe to one or more
> subscriptions, based on the type of operators it wants to implement. When
> available, tasks are added to the task sets and try to be added to each
> subscriptions the task set is subscribed to. A subscription may or may not
> add a task based on its semantics. Each task configuration is the union
> between the task set conf, each subscription the task was successfully
> subscribed too (which also contains the conf for each operator in the
> subscription) plus the configuration of the service.
> When a failure occurs during the execution of an operator, the task set
> forward it to subscription containing the failing operation, which update
> the failure rate on its related failure state machine and eventually
> generate a failure response. For the moment I am considering as failure
> responses / failure states: continue, reconfigure (e.g., reconfigure the
> topology so that computation can be continue), continue and reschedule a
> task, stop the world and wait for a new task to be active. (New failure
> machine implementations can be provided by the users by extending the
> IFailureStateMachine interface). The failure is then propagated to the next
> operator in the pipeline and up to the service (i.e., multiple operators
> may fails in different subscriptions therefore the final decision have to
> be taken at the service level). The service decides which action to take
> based on its own failure state machine. The action is implemented by
> related mechanism at the task set level.
> I have two goals in mind when developing this design:
> - have a design expressible enough to be applicable to different scenarios
> (e.g., IMRU and parameter servers).
> - have a design still flexible enough so that any failure response
> mechanism could be implemented (from do nothing, to recompute the state
> using lineage information from previous operators).
> Any feedback is welcome!
> Best,
> Matteo

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message