reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <j...@apache.org>
Subject [jira] [Created] (REEF-1338) Race condition in Evaluator shutdown
Date Wed, 13 Apr 2016 18:57:25 GMT
Markus Weimer created REEF-1338:
-----------------------------------

             Summary: Race condition in Evaluator shutdown
                 Key: REEF-1338
                 URL: https://issues.apache.org/jira/browse/REEF-1338
             Project: REEF
          Issue Type: Bug
          Components: REEF.NET
            Reporter: Markus Weimer


During the [pull request|https://github.com/apache/reef/pull/940] review of [REEF-1312], we
noticed a rare race condition during the Evaluator shutdown. It was exposed in one out of
11 runs of the tests:

{noformat}
Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime

Expected number of contexts to close (4) differs from actual number of success indicators
(8)\r\nExpected: True\r\nActual:   False

   at Org.Apache.REEF.Tests.Functional.ReefFunctionalTest.ValidateSuccessForLocalRuntime(Int32
numberOfContextsToClose, Int32 numberOfTasksToFail, Int32 numberOfEvaluatorsToFail, String
testFolder) in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\ReefFunctionalTest.cs:line
179
   at Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime()
in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\IMRUMapperCountTest.cs:line 38
{noformat}

The root of the test failure has been traced to the Evaluator being in a bad state:

{noformat}
 Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Error: 0 : 2016-04-13T10:24:03.5192372-07:00
0007 ERROR: evaluator Node-1-1460568240750 failed with exceptionencountered error [System.InvalidOperationException:
Received a control message from Driver after Evaluator is done.] with mesage [Received a control
message from Driver after Evaluator is done.] and stack trace [] Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager
Information: 0 : 2016-04-13T10:24:03.5197049-07:00 0007 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto:
task_id=[], task_status=[], task_message=[], evaluator_status=[FAILED], context_status=[],
timestamp=[1460568243519], recoveryFlag =[False].
{noformat}

The complete runtime folder is available for download [here|https://onedrive.live.com/redir?resid=5801726772BFC3DA!273888&authkey=!ACMrVHlIHAHvCi8&ithint=file%2czip]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message