reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Weimer <>
Subject Re: Issues with IActiveContext.SubmitContextAndService
Date Tue, 22 Mar 2016 17:55:11 GMT
On 2016-03-22 10:41, Julia Wang (QIUHE) wrote:
> For this phase, what we want to handle is if evaluators/contexts
> fail, request new evaluators, stop un-impacted tasks, then start
> entire group again.

I think you mean the same thing. Let me try to be more precise to sort
out the misunderstanding.

Let's say we run an IMRU job on a lot of Evaluators, and one of them,
let's call it "F" fails. That is: on the Driver, we receive an
`IFailedEvaluator` for Evaluator F.

Today, this means that the job fails, because IMRU doesn't have a
handler bound for `IFailedEvaluator`.

In REEF-1223, we aim to add such an event handler. The goal is to react
to the failure in the following way:

  1. Request a replacement F' for F.
  2. Load the partition managed in F into F'

and in parallel:

  3. Shut down the current set of still running Tasks.

After which we can

  4. Start a fresh set of IMRU tasks on the Evaluators, including F'.

Now, Dhruv points out that when an Evaluator fails, the group
communication in the "neighboring" Evaluators will also fail, which
leads to a cascade of related failures we receive in the Driver.

We cannot allow these failures to take down the otherwise fine
Evaluators. Dhruv's proposal is to separate the Data and the Group
Communications into separate Contexts on the Evaluators. That way, the
failing group communications won't affect the data stored in those

An alternative design is to keep the Group Communications in the Tasks.
That way, we receive a bunch of `IFailedTask` events in the Driver, and
the data kept in the (only) Context in the Evaluators is unaffected.

Of course, both of these approaches hinge on our ability to identify
these `IFailedContext` or `IFailedTask` events as part of the (expected)
failures after we received an `IFailedEvaluator`. This can be very
tricky, as there is no guarantee that the `IFailedEvaluator` is actually
received first in time.


View raw message