ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Goncharuk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-5115) Investigation of failing tests of coordinator node failure
Date Tue, 04 Dec 2018 13:50:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708722#comment-16708722

Alexey Goncharuk commented on IGNITE-5115:

[~NSAmelchev], why do we fail the old coordinator if a non-verified message is received? From
the explanation above I thought the message should have been verified by another node that
already became a coordinator.

[~yzhdanov], can you take a look? The change seems a bit dangerous to me, because we fail
the coordinator node when a non-verified message is received.

> Investigation of failing tests of coordinator node failure 
> -----------------------------------------------------------
>                 Key: IGNITE-5115
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5115
>             Project: Ignite
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Sergey Chugunov
>            Assignee: Amelchev Nikita
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.8
> Tests *customEventCoordinatorFailure1/2* from *TcpDiscoverySelfTest* are flaky on TC
and sometimes hang with the following assertion in logs:
> {code}
> Exception in thread "tcp-disco-msg-worker-#5245%tcp.TcpDiscoverySelfTest0%" java.lang.AssertionError
> 	at org.apache.ignite.spi.discovery.tcp.internal.TcpDiscoveryNodesRing.removeNode(TcpDiscoveryNodesRing.java:353)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processNodeFailedMessage(ServerImpl.java:4670)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2567)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.processMessage(ServerImpl.java:2366)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6485)
> 	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2456)
> 	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}
> It seems that this happens because tests' implementation drops connections of *TcpCommunicatonSpi*
on coordinator node with *simulateNodeFailure* method.
> At the same time tests leave *TcpDiscoverySpi* operational; it receives subsequent NodeFailed
message and throws the assertion error shown above.
> The whole situation looks legitimate as it is possible to imagine a situation when CommSPI
connections on coordinator fail for some reason while DiscoSPI connections are healthy.
> It is needed to investigate the situation deeper, figure out whether the root cause is
using of *simulateNodeFailure* or not and propose a solution if the error may happen in the
real life.

This message was sent by Atlassian JIRA

View raw message