zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andor Molnar (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ZOOKEEPER-3157) Improve FuzzySnapshotRelatedTest to avoid flaky due to issues like connection loss
Date Fri, 05 Oct 2018 15:19:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639943#comment-16639943
] 

Andor Molnar edited comment on ZOOKEEPER-3157 at 10/5/18 3:18 PM:
------------------------------------------------------------------

[~lvfangmin] [~hanm]

I think I have a better approach for this, what do you think:

The problem is here:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

I believe that when the check validates the CONNECTED state, the client hasn't realised yet
that the server went down and it's still connected. The check goes on and the rest is just
about luck and good timing. I would add an additional check like this:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTING);
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

Just to make sure that the client gets fully disconnected before restarting the follower.


was (Author: andorm):
[~lvfangmin] [~hanm]

I think I have a better approach for this, what do you think:

The problem is here:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

I believe that the problem is when the check validates the CONNECTED state, the client has
realised yet that the server went down and it's still connected. The check goes on and the
rest is just about luck and good timing. I would add an additional check like this:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTING);
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

Just to make sure that the client gets fully disconnected before restarting the follower.

> Improve FuzzySnapshotRelatedTest to avoid flaky due to issues like connection loss
> ----------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3157
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3157
>             Project: ZooKeeper
>          Issue Type: Test
>          Components: tests
>    Affects Versions: 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Andor Molnar
>            Priority: Minor
>             Fix For: 3.6.0
>
>
> [~hanm] noticed that the test might failure because of ConnectionLoss when trying to
getData, [here is an example|https://builds.apache.org/job/ZooKeepertrunk/198/testReport/junit/org.apache.zookeeper.server.quorum/FuzzySnapshotRelatedTest/testPZxidUpdatedWhenLoadingSnapshot],
we should catch this and retry to avoid flaky.
> Internally, we 'fixed' flaky test by adding junit.RetryRule in ZKTestCase, which is
the base class for most of the tests. I'm not sure this is the right way to go or not, since
it's actually 'hiding' the flaky tests, but this will help reducing the flaky tests a lot
if we're not going to tackle it in the near time, and we can check the testing history to
find out which tests are flaky and deal with them separately. So let me know if this seems
to provide any benefit in short term, if it is I'll provide a patch to do that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message