reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Commented] (REEF-1343) Fix events received in case of evaluator failure
Date Thu, 28 Apr 2016 19:59:13 GMT


Markus Weimer commented on REEF-1343:

Hmm, I fall on the side of consistency here. We should have *one* log system that contains
all the information. Adding {{Console.Write...()}} seems messy.

However, logging could be configured to also write a second file which only contains all the
logs from the {{WARNING}} and {{ERROR}} level.

> Fix events received in case of evaluator failure
> ------------------------------------------------
>                 Key: REEF-1343
>                 URL:
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>            Assignee: Andrew Chung
>            Priority: Critical
>              Labels: FT
>             Fix For: 0.15
> Investigation of REEF-1325 shows a weird sequence of events on local runtime: 
> * evaluator crashes with an unhandled exception (shown in evaluator.stderr and .stdout
> * driver receives {{IFailedEvaluator}} event which doesn't have associated {{FailedTask}}
> * the task continues running and completes successfully
> * driver receives {{ICompletedTask}} event.
> By design, failed evaluator shouldn't allow for a successful task completion.
> This can be reproduced using {{TestPoisonedEvaluatorStartHanlder}} test.
> Update:
> The root cause is due to the Evaluator not properly closing itself and allowing the {{Exception}}
to propagate upwards. This results in the {{RuntimeStopHandler}} not being invoked, and provided
that the user's {{ITask}} is spun off as a fire-and-forget {{System.Threading.Task}}, its
execution is independent from the main Evaluator thread. This means that when the {{ITask}}
finishes, it will send a Heartbeat back to the Driver that it completed, even though in reality
the Evaluator has already failed. The fix catches the Evaluator failure and propagates the
{{Exception}} to {{RuntimeStopHandler}}, as well as properly closes off the {{ContextManager}}
and {{HeartbeatManager}} once the {{Exception}} surfaces.

This message was sent by Atlassian JIRA

View raw message