reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julia (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (REEF-976) Fix broken C# Tests caused by race condition of local RM
Date Thu, 31 Mar 2016 23:56:25 GMT

    [ https://issues.apache.org/jira/browse/REEF-976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219141#comment-15219141
] 

Julia edited comment on REEF-976 at 3/31/16 11:56 PM:
------------------------------------------------------

REEF-1306 shows the error logs in BroadcastReduce test. Sometimes, after a task is done, it
still throws failed evaluator exception if some network object is not disposed properly. 


was (Author: juliaw):
Those two might be related.

> Fix broken C# Tests caused by race condition of local RM
> --------------------------------------------------------
>
>                 Key: REEF-976
>                 URL: https://issues.apache.org/jira/browse/REEF-976
>             Project: REEF
>          Issue Type: Bug
>          Components: REEF-Tests, REEF.NET
>            Reporter: Andrew Chung
>            Assignee: Andrew Chung
>              Labels: FT
>
> There is a race condition in REEF-Local-Runtime, and it can happen as follows:
> # The Evaluator sends the {{DONE}} message and exits its process.
> # The RM discovers Evaluator ends, sends {{DONE}} message to Driver.
> # Driver first gets {{DONE}} message from RM before getting reading the {{DONE}} message
from the Evaluator in its network queue.
> # Driver calls {{FailedEvaluatorHandler}}, even though the Evaluator shuts down properly.
> This can be fixed by requiring an {{ACK}} from the Driver prior to letting the Evaluator
exit its process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message