reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Yang <>
Subject Re: Question about failure scenarios
Date Thu, 23 Jun 2016 05:38:03 GMT
Julia, with regard to the REEF Task submission, here's how Vortex goes on
about it.

- Upon AllocatedEvaluatorHandler, Vortex submits a REEF Task without any
bookkeeping. (Note that Vortex's REEF Task itself does not do anything
useful itself upon instantiation; It only waits for the Driver to schedule
tasklets onto it.)
- *(If FailedEvaluatorHandler is invoked here)* We request for a new
Evaluator. FailedEvaluator#getFailedTask returns null, and is thus ignored.
- RunningTaskHandler is invoked and we memorize the task's id. (At this
point of time, Vortex considers the Evaluator to be ready have tasklets
scheduled onto)
- *(If FailedEvaluatorHandler is invoked here) *We request for a
new Evaluator and obtain the task id from FailedEvaluator#getFailedTask to
un-memorize it

Note that we rely on the fact that Driver uses a dedicated thread per
Evaluator for processing their incoming events, which guarantees the
invocations of their handlers to be serialized. Because of this, I believe
RunningTaskHandler is never invoked after FailedEvaluatorHandler, and thus
no garbage task ids will be created.


On Wed, Jun 22, 2016 at 10:42 PM, John Yang <> wrote:

> Hi,
> Things are the same with the Mesos runtime. It receives resource offers
> from Mesos. If the offers do not satisfy the resource requests from the
> REEF user, it declines the offers, in order to receive new offers. It keeps
> at it until it gets the right offers, which might never come.
> The best solution I can think of right now is similar to Markus's. You
> have to somehow identify this behavior at the REEF-user-level based on your
> application semantics. If we were to have this functionality at the RM
> runtime-level(reef-runtime-yarn, reef-runtime-mesos, etc), I guess we can
> make a set of standard configuration parameters to be used in each runtime.
> However I'm not sure the benefits outweigh the added complexity, since the
> resource capacity/availability of a cluster is usually considered
> non-deterministic.
> Thanks,
> John
> On Wed, Jun 22, 2016 at 2:58 PM, Markus Weimer <> wrote:
>> On 2016-06-22 13:31, Tobin Baker wrote:
>>> My experience on the Java side has been that if insufficient
>>> resources are available to allocate all requested Evaluators, the RM
>>> will just silently keep retrying forever, with no exceptions thrown
>>> by YARN or REEF.
>> That is correct. YARN does not, presently, give "no" as an answer to a
>> unsatisfiable resource request. The only way I know to guard against it
>> is to set a timer and to give up if the needed containers can't be
>> acquired within the timeout.
>> Markus

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message