reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mariia Mykhailova (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (REEF-1464) Fix TestTaskCloseOnLocalRuntime failures in AppVeyor
Date Fri, 02 Dec 2016 23:09:58 GMT

    [ https://issues.apache.org/jira/browse/REEF-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15716816#comment-15716816
] 

Mariia Mykhailova commented on REEF-1464:
-----------------------------------------

More details: I've isolated logs from driver and IMRU master task for [successful test run|https://ci.appveyor.com/project/tcNickolas/reef/build/381-REEF-1464/job/635v2d8c0ls7jd87]
and [failure|https://ci.appveyor.com/project/tcNickolas/reef/build/381-REEF-1464/job/g3kx6uamo423d3uf].

In both cases IMRU master task receives close event. But in successful case, it gets an exception
quickly and proceeds to return with cancellation token = true.

{{ERROR: Received exception in UpdateTaskHost with cancellation token True: [System.OperationCanceledException:
GetData operation is canceled}}

In failure case, task enters a loop trying to connect to one of the tasks which are already
closed, and spends the extra 7 minutes doing 200 attempts to connect. 

{noformat}
Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:53:06.7473517+00:00
0011
INFO: Retry - Count:1, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005):
No connection could be made because the target machine actively refused it 127.0.0.1:9458
...
Org.Apache.REEF.Wake.Remote.Impl.RemoteConnectionRetryHandler Information: 0 : 2016-12-02T20:59:48.7139088+00:00
0011
INFO: Retry - Count:200, Delay:00:00:01, Exception:System.Net.Sockets.SocketException (0x80004005):
No connection could be made because the target machine actively refused it 127.0.0.1:9458
...
ERROR: Received exception in UpdateTaskHost with cancellation token True: [Org.Apache.REEF.Wake.Remote.Impl.TcpClientConnectionException:
Retried 200 times but connection to endpoint 127.0.0.1:9458 failed, RetriesDone=200 --->
System.Net.Sockets.SocketException: No connection could be made because the target machine
actively refused it 127.0.0.1:9458
{noformat}

It looks like our cancellation works differently when the master task is in different stages
of an iteration. Need to look into this further, but so far it looks like a genuine problem
with IMRU code, not just a poorly written test.

> Fix TestTaskCloseOnLocalRuntime failures in AppVeyor
> ----------------------------------------------------
>
>                 Key: REEF-1464
>                 URL: https://issues.apache.org/jira/browse/REEF-1464
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF.NET
>            Reporter: Mariia Mykhailova
>            Assignee: Mariia Mykhailova
>
> {{O.A.R.Tests.Functional.IMRU.IMRUCloseTaskTest.TestTaskCloseOnLocalRuntime}} fails frequently
in AppVeyor test runs. The error is typically "Cannot read from log file" with rest runtime
65-70 seconds, and according to our current thinking in REEF-1417 this is caused by Evaluator(s)
still running after 60 seconds of test execution. 
> [In one case|https://ci.appveyor.com/project/ApacheSoftwareFoundation/reef/build/641-master/tests]
the test completed in 38 seconds, with error message
> {noformat}
> Assert.Equal() Failure
> Expected: 4
> Actual:   5
> at 
> Assert.Equal(numTasks, failedCount + completedCount);
> {noformat}
> In [successful run|https://ci.appveyor.com/project/ApacheSoftwareFoundation/reef/build/644-master/tests],
the test takes 30 seconds.
> We need to investigate whether IMRU job itself takes longer than 60 seconds on AppVeyor
machines, or whether Evaluator doesn't close properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message