reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariia Mykhailova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1625) Fix TestFailMapperEvaluatorsOnDispose failures in AppVeyor
Date Tue, 25 Oct 2016 23:59:59 GMT

    [ https://issues.apache.org/jira/browse/REEF-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606856#comment-15606856
] 

Mariia Mykhailova commented on REEF-1625:
-----------------------------------------

Sometimes we get {{Actual: 6}}. I suspect what happens here is the following.

The test is supposed to fail evaluator after all tasks are completed, so that IMRU FT doesn't
start a retry. We use {{Dispose}} to simulate failure at this time, since we don't want to
modify IMRU code and thus need some task-initiated failure. 

However, we don't wait for all tasks to complete before we start disposing of them. Tasks
are disposed of immediately after they report completion, following normal REEF task lifecycle.
So there is a race condition: if all tasks complete before the ones with failure injected
get disposed of, test succeeds, but if one of the tasks with failure injected completes early
and proceeds to dispose, the system gets evaluator failure before task completions and goes
on to retry.

This is a bit tricky to fix. I see options:
* analyze the number of retries done and amend our test verification to account for the retries.
But this is imprecise, because we don't know how many tasks had a chance to complete before
failed evaluator event. So we can only check that number of failed evaluators = 2 * numberOfRetriesDone
(i.e. at the last retry there were also 2 failed evaluators) and the job succeeded. Also,
there is non-zero probability of failing task being fast every time (can be reduced to use
only 1 failure each time instead of 2).
* delay the failure. Can we do a short {{Sleep}} before failure in failing evaluators? This
will make the tests faster than they are now because there wouldn't be a retry involved. Synchronizing
via driver with all other evaluators completion will bring in a lot of complexity which I'd
rather avoid.

> Fix TestFailMapperEvaluatorsOnDispose failures in AppVeyor
> ----------------------------------------------------------
>
>                 Key: REEF-1625
>                 URL: https://issues.apache.org/jira/browse/REEF-1625
>             Project: REEF
>          Issue Type: Sub-task
>          Components: IMRU, REEF.NET
>            Reporter: Mariia Mykhailova
>
> {noformat}
> Assert.Equal() Failure
> Expected: 2
> Actual:   4
>    at Org.Apache.REEF.Tests.Functional.IMRU.TestFailMapperEvaluatorsOnDispose.TestFailedMapperOnLocalRuntime()
in C:\projects\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\TestFailMapperEvaluatorsOnDispose.cs:line
66
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message