ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Chugunov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-8131) ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC
Date Fri, 06 Jul 2018 09:50:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534641#comment-16534641

Sergey Chugunov commented on IGNITE-8131:


I reviewed the change and it looks somewhat reasonable for me, tests look fine as well. But
I still have a feeling that we don't fix the root cause of the problem but mask it (most likely
it is some kind of race as introducing a delay helps to fix it).

What makes me think like this is that (again from analysis of attached logs) is that in failure
example I don't see even report about disconnected event: like client was never able to detect
that it has disconnected from topology.
And your analysis doesn't explain lack of disconnected event but talks only about reconnect

Could you please explain from your understanding the sequence of events as detailed as possible?
Maybe even with references into the code.

Because I see in logs that in successful scenario client detects connection loss almost immediately
and switches its state to Disconnected:
[2018-06-09 20:12:35,312][INFO ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient]
ZooKeeper client state changed [prevState=Connected, newState=Disconnected]
And in failure scenario client does something different at probably similar moment in time:
[2018-06-09 20:12:45,591][WARN ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient]
Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
[2018-06-09 20:12:45,591][WARN ][zk-internal.ZookeeperDiscoverySpiTest1-EventThread][ZookeeperClient]
ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
retryTimeout=2000, connLossTimeout=2000, path=/apacheIgnite/n/81b80f1f-744f-47d9-bd8b-5cd17c946376:a7926ce8-3713-4270-a788-1e3e8b000001:81|0000000052,
It seems to me that in failure scenario client receives ConnectionLoss when executing the
code that is not ready for this exception and handles it wrongly.

Another idea here maybe that on connection loss client cannot do necessary cleanup in ZooKeeper
and when it establishes new connection to ZK it cannot figure out that it has to generate
disconnected event and make a reconnect attempt.


> ZookeeperDiscoverySpiTest#testClientReconnectSessionExpire* tests fail on TC
> ----------------------------------------------------------------------------
>                 Key: IGNITE-8131
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8131
>             Project: Ignite
>          Issue Type: Bug
>          Components: zookeeper
>            Reporter: Sergey Chugunov
>            Assignee: Denis Garus
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.7
>         Attachments: ZK_client_reconnect_failure.log, ZK_client_reconnect_success.log
> Two tests always fail on TC with the assertion
> {noformat}
> junit.framework.AssertionFailedError: Failed to wait for disconnect/reconnect event.
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.waitReconnectEvent(ZookeeperDiscoverySpiTest.java:4221)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.reconnectClientNodes(ZookeeperDiscoverySpiTest.java:4183)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.clientReconnectSessionExpire(ZookeeperDiscoverySpiTest.java:2231)
>     at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoverySpiTest.testClientReconnectSessionExpire1_1(ZookeeperDiscoverySpiTest.java:2206)
> {noformat}
> from client disconnect/reconnect events check. Obviously client doesn't generate these
events as it supposed to do.
> (TC runs can be found [here|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_IgniteZooKeeperDiscovery&branch_IgniteTests24Java8=pull%2F3730%2Fhead&tab=buildTypeStatusDiv]).
> It is possible to reproduce test failure locally as well, but with low probability: one
failure for 50 or even 300 successful executions.

This message was sent by Atlassian JIRA

View raw message