ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Chugunov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-5115) Investigation of failing tests of coordinator node failure
Date Fri, 28 Apr 2017 14:49:04 GMT
Sergey Chugunov created IGNITE-5115:

             Summary: Investigation of failing tests of coordinator node failure 
                 Key: IGNITE-5115
                 URL: https://issues.apache.org/jira/browse/IGNITE-5115
             Project: Ignite
          Issue Type: Task
          Components: messaging
            Reporter: Sergey Chugunov
             Fix For: 2.1

Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are flaky on TC and sometimes
hang with the following assertion in logs:
Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" java.lang.AssertionError
	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

It seems that this happens because tests' implementation drops connections of *TcpCommunicatonSpi*
on coordinator node with *simulateNodeFailure* method.
At the same time tests leave *TcpDiscoverySpi* operational; it receives subsequent NodeFailed
message and throws the assertion error shown above.

The whole situation looks legitimate as it is possible to imagine a situation when CommSPI
connections on coordinator fail for some reason while DiscoSPI connections are healthy.

It is needed to investigate the situation deeper, figure out whether the root cause is using
of *simulateNodeFailure* or not and propose a solution if the error may happen in the real

This message was sent by Atlassian JIRA

View raw message