curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Dain (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CURATOR-87) new LeaderLatch "jitters" after network outage
Date Fri, 14 Feb 2014 00:34:19 GMT
Oliver Dain created CURATOR-87:
----------------------------------

             Summary: new LeaderLatch "jitters" after network outage
                 Key: CURATOR-87
                 URL: https://issues.apache.org/jira/browse/CURATOR-87
             Project: Apache Curator
          Issue Type: Bug
          Components: Recipes
    Affects Versions: 2.2.0-incubating
         Environment: OS-X
            Reporter: Oliver Dain
            Priority: Minor


I have a LeaderLatch that has become the leader. Then all of ZooKeeper becomes unreachable
(due to network issues or something). I do know that I could maintain the same LeaderLatch
instance and when ZK becomes reachable again it would re-negotiate leadership. However, for
my particular use case this doesn't work and I have to release the LeaderLatch. Later, when
ZK is available again I allocate a new LeaderLatch instance and call start() and on it. The
bug is that this when await() is called on the new latch it immediately calls the isLeader
callback and then almost immediately after the await() call returns, notLeader gets called.

The following unit test reproduces the problem:

 @Test
    public void leaderLatchJitters() throws Exception {
        TestingServer server = new TestingServer();
        CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(),
                new ExponentialBackoffRetry(1000, 3));
        zkClient.start();


        LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
        final AtomicInteger numIsLeader = new AtomicInteger(0);
        final AtomicInteger numNotLeader = new AtomicInteger(0);

        LeaderLatchListener lll = new LeaderLatchListener() {
            @Override
            public void isLeader() {
                log.debug("isLeader called");
                numIsLeader.incrementAndGet();
            }

            @Override
            public void notLeader() {
                log.debug("notLeader called");
                numNotLeader.incrementAndGet();
            }
        };

        leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());

        leaderLatch.start();
        leaderLatch.await();
        assertTrue(leaderLatch.hasLeadership());
        assertEquals(1, numIsLeader.get());
        assertEquals(0, numNotLeader.get());

        // Shut down the server, wait for us to lose the lock, then restart
        File zkTmpDir = server.getTempDirectory();
        int zkServerPort = server.getPort();
        server.stop();

        while (leaderLatch.hasLeadership()) {
            log.debug("Waiting for curator to notice it's not the leader");
            Thread.sleep(100);
        }
        log.debug("Curator has noticed that it is no longer the leader");
        assertEquals(1, numNotLeader.get());
        assertEquals(1, numIsLeader.get());

        leaderLatch.close();

        // Restart ZooKeeper
        server = new TestingServer(zkServerPort, zkTmpDir);

        leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
        leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
        log.debug("Calling leaderLatch.start()");
        leaderLatch.start();

        log.debug("Trying to regain leadership");
        leaderLatch.await();
        log.debug("We have regained leadership");

        // Wait so we have time to observe the "jitter"
        Thread.sleep(100);

        assertTrue(leaderLatch.hasLeadership());
        // Bug here. numIsLeader == 3
        assertEquals(2, numIsLeader.get());
        // Bug here too, numNotLeader == 2
        assertEquals(1, numNotLeader.get());

        log.debug("calling leaderLatch.close");
        leaderLatch.close();
}

The output from this is:

Running com.threeci.commons.zkrecipes.TransactionalLockTest
0    [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader
called
104  [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Curator has noticed
that it is no longer the leader
171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Calling leaderLatch.start()
172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Trying to regain
leadership
1882 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader
called
1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We have regained
leadership
1883 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader
called
1885 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader
called
2084 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<< FAILURE!
java.lang.AssertionError: expected:<2> but was:<3>



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message