reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Chung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1406) Fix TestEvaluatorWithActiveContextImmediatePoison failures with unattached context
Date Mon, 20 Jun 2016 23:19:57 GMT

    [ https://issues.apache.org/jira/browse/REEF-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15340648#comment-15340648
] 

Andrew Chung commented on REEF-1406:
------------------------------------

I'm fairly sure the root cause for this is the following:
C# Evaluator fails and writes heartbeat message to Java. Java receives the heartbeat but the
message is still in the network buffer, waiting to be read. However, at this  time, the local
ResourceManager that monitors the Evaluator process detects the exit and notifies the Driver
that the Evaluator has ended, causing a conflict in status (expected was RUNNING). We had
the same problem on closing the Evaluator but we fixed it with the Driver ACKING the Evaluator
on close; however the Evaluator should not wait for the ACK on a failure case. A solution
is for the Driver to drain the network buffer upon Evaluator failure to verify that the heartbeat
has indeed arrived, but IIRC at this point the necessary APIs are not yet exposed for this
to work.

> Fix TestEvaluatorWithActiveContextImmediatePoison failures with unattached context
> ----------------------------------------------------------------------------------
>
>                 Key: REEF-1406
>                 URL: https://issues.apache.org/jira/browse/REEF-1406
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF.NET
>    Affects Versions: 0.15
>            Reporter: Mariia Mykhailova
>
> {{TestEvaluatorWithActiveContextImmediatePoison}} test fails transiently, typically in
in AppVeyor. From the logs of [~markus.weimer]'s repro, the evaluator fails (correctly) but
doesn't have failed context attached (incorrectly, we expect to have a context attached).
> Most likely the failure happens too soon (with 0 msec delay after {{ContextConfiguration.OnContextStart}}
event) before the information about the context has time to propagate to evaluator. We need
to check whether this can be fixed in REEF code, or only in test code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message