reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Chung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1343) Fix events received in case of evaluator failure
Date Wed, 27 Apr 2016 23:06:12 GMT

    [ https://issues.apache.org/jira/browse/REEF-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261155#comment-15261155
] 

Andrew Chung commented on REEF-1343:
------------------------------------

Sorry messed up the formatting.

> Fix events received in case of evaluator failure
> ------------------------------------------------
>
>                 Key: REEF-1343
>                 URL: https://issues.apache.org/jira/browse/REEF-1343
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>            Assignee: Andrew Chung
>            Priority: Critical
>              Labels: FT
>             Fix For: 0.15
>
>
> Investigation of REEF-1325 shows a weird sequence of events on local runtime: 
> * evaluator crashes with an unhandled exception (shown in evaluator.stderr and .stdout
files).
> * driver receives {{IFailedEvaluator}} event which doesn't have associated {{FailedTask}}
object.
> * the task continues running and completes successfully
> * driver receives {{ICompletedTask}} event.
> By design, failed evaluator shouldn't allow for a successful task completion.
> This can be reproduced using {{TestPoisonedEvaluatorStartHanlder}} test.
> Update:
> The root cause is due to the Evaluator not properly closing itself and allowing the {{Exception}}
to propagate upwards. This results in the {{RuntimeStopHandler}} not being invoked, and provided
that the user's {{ITask}} is spun off as a fire-and-forget {{System.Threading.Task}}, its
execution is independent from the main Evaluator thread. This means that when the {{ITask}}
finishes, it will send a Heartbeat back to the Driver that it completed, even though in reality
the Evaluator has already failed. The fix catches the Evaluator failure and propagates the
{{Exception}} to {{RuntimeStopHandler}}, as well as properly closes off the {{ContextManager}}
and {{HeartbeatManager}} once the {{Exception}} surfaces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message