reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Resolved] (REEF-1343) Fix events received in case of evaluator failure
Date Mon, 25 Apr 2016 21:10:13 GMT


Markus Weimer resolved REEF-1343.
       Resolution: Fixed
    Fix Version/s: 0.15

Resolved via [#961|]

> Fix events received in case of evaluator failure
> ------------------------------------------------
>                 Key: REEF-1343
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>            Assignee: Andrew Chung
>            Priority: Critical
>              Labels: FT
>             Fix For: 0.15
> Investigation of REEF-1325 shows a weird sequence of events on local runtime: 
> * evaluator crashes with an unhandled exception (shown in evaluator.stderr and .stdout
> * driver receives {{IFailedEvaluator}} event which doesn't have associated {{FailedTask}}
> * the task continues running and completes successfully
> * driver receives {{ICompletedTask}} event.
> By design, failed evaluator shouldn't allow for a successful task completion.
> This can be reproduced using {{TestPoisonedEvaluatorStartHanlder}} test.
> Update:
> The root cause is due to the Evaluator not properly closing itself and allowing the {{Exception}}
to propagate upwards. This results in the {{RuntimeStopHandler}} not being invoked, and provided
that the user's {{ITask}} is spun off as a fire-and-forget {{System.Threading.Task}}, its
execution is independent from the main Evaluator thread. This means that when the {{ITask}}
finishes, it will send a Heartbeat back to the Driver that it completed, even though in reality
the Evaluator has already failed. The fix catches the Evaluator failure and propagates the
{{Exception}} to {{RuntimeStopHandler}}, as well as properly closes off the {{ContextManager}}
and {{HeartbeatManager}} once the {{Exception}} surfaces.

This message was sent by Atlassian JIRA

View raw message