reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (REEF-1343) Fix events received in case of evaluator failure
Date Mon, 25 Apr 2016 21:10:13 GMT

     [ https://issues.apache.org/jira/browse/REEF-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Weimer resolved REEF-1343.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 0.15

Resolved via [#961|https://github.com/apache/reef/pull/961]

> Fix events received in case of evaluator failure
> ------------------------------------------------
>
>                 Key: REEF-1343
>                 URL: https://issues.apache.org/jira/browse/REEF-1343
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>            Assignee: Andrew Chung
>            Priority: Critical
>              Labels: FT
>             Fix For: 0.15
>
>
> Investigation of REEF-1325 shows a weird sequence of events on local runtime: 
> * evaluator crashes with an unhandled exception (shown in evaluator.stderr and .stdout
files).
> * driver receives {{IFailedEvaluator}} event which doesn't have associated {{FailedTask}}
object.
> * the task continues running and completes successfully
> * driver receives {{ICompletedTask}} event.
> By design, failed evaluator shouldn't allow for a successful task completion.
> This can be reproduced using {{TestPoisonedEvaluatorStartHanlder}} test.
> Update:
> The root cause is due to the Evaluator not properly closing itself and allowing the {{Exception}}
to propagate upwards. This results in the {{RuntimeStopHandler}} not being invoked, and provided
that the user's {{ITask}} is spun off as a fire-and-forget {{System.Threading.Task}}, its
execution is independent from the main Evaluator thread. This means that when the {{ITask}}
finishes, it will send a Heartbeat back to the Driver that it completed, even though in reality
the Evaluator has already failed. The fix catches the Evaluator failure and propagates the
{{Exception}} to {{RuntimeStopHandler}}, as well as properly closes off the {{ContextManager}}
and {{HeartbeatManager}} once the {{Exception}} surfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message