reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariia Mykhailova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1417) Kill Evaluators in .NET functional tests after 40 seconds
Date Mon, 06 Jun 2016 18:51:20 GMT

    [ https://issues.apache.org/jira/browse/REEF-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316973#comment-15316973
] 

Mariia Mykhailova commented on REEF-1417:
-----------------------------------------

This use of real time is not exactly in the distributed system itself but rather in tests
(and only local ones at that). Besides, we're already using them in {{ReefFunctionalTest}}
class when we're waiting for the logs for 60 seconds, and we have REEF-1184 to limit amount
of real time spent on individual test. So it should be fine

> Make the longest test run time a constant.
Yes, since it will be used only in {{ReefFunctionalTest}} (where the log is read) it will
be a constant in that class.

> Use IClock time units instead of seconds.
For this we need to use {{IClock}} in {{PoisonedEventHandler}} instead of {{RuntimeClock}},
REEF-1069. 

> Kill Evaluators in .NET functional tests after 40 seconds
> ---------------------------------------------------------
>
>                 Key: REEF-1417
>                 URL: https://issues.apache.org/jira/browse/REEF-1417
>             Project: REEF
>          Issue Type: Test
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>
> When running O.A.R.Tests.Functional, I often observe transient test failures with “Cannot
read from log file” error message. I believe this indicates that evaluator doesn’t exit
within 60 seconds from the start of the test, driver keeps waiting for it, and keeps its log
locked (and after 60 seconds tests stop retrying to read log). This failure correlates with
Evaluator.exe processes spawned by {{vstest.executionengine.exe}} left running after the end
of the tests, which keep files under {{lang\cs\bin\x64\Debug\O.A.R.Tests\REEF_LOCAL_RUNTIME…}}
locked.
> I suggest we try to remedy this by poisoning one of events handled at evaluator side,
so that after, say, 40 seconds evaluator crashes no matter what. In normal scenarios our tests
should be done much faster than that, under 30 seconds, so it shouldn’t mask a real failure.
This way we’ll be able to see a real error based on the messages written (or not written)
to the logs, instead of log unavailability message. If we see a poisoning message in the log,
it’s also a test failure if we don’t expect the test to run so long. And this will spare
us the need to kill runaway Evaluator process manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message