reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Weimer (JIRA)" <>
Subject [jira] [Created] (REEF-1338) Race condition in Evaluator shutdown
Date Wed, 13 Apr 2016 18:57:25 GMT
Markus Weimer created REEF-1338:

             Summary: Race condition in Evaluator shutdown
                 Key: REEF-1338
             Project: REEF
          Issue Type: Bug
          Components: REEF.NET
            Reporter: Markus Weimer

During the [pull request|] review of [REEF-1312], we
noticed a rare race condition during the Evaluator shutdown. It was exposed in one out of
11 runs of the tests:


Expected number of contexts to close (4) differs from actual number of success indicators
(8)\r\nExpected: True\r\nActual:   False

   at Org.Apache.REEF.Tests.Functional.ReefFunctionalTest.ValidateSuccessForLocalRuntime(Int32
numberOfContextsToClose, Int32 numberOfTasksToFail, Int32 numberOfEvaluatorsToFail, String
testFolder) in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\ReefFunctionalTest.cs:line
   at Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime()
in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\IMRUMapperCountTest.cs:line 38

The root of the test failure has been traced to the Evaluator being in a bad state:

 Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Error: 0 : 2016-04-13T10:24:03.5192372-07:00
0007 ERROR: evaluator Node-1-1460568240750 failed with exceptionencountered error [System.InvalidOperationException:
Received a control message from Driver after Evaluator is done.] with mesage [Received a control
message from Driver after Evaluator is done.] and stack trace [] Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager
Information: 0 : 2016-04-13T10:24:03.5197049-07:00 0007 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto:
task_id=[], task_status=[], task_message=[], evaluator_status=[FAILED], context_status=[],
timestamp=[1460568243519], recoveryFlag =[False].

The complete runtime folder is available for download [here|!273888&authkey=!ACMrVHlIHAHvCi8&ithint=file%2czip]

This message was sent by Atlassian JIRA

View raw message