curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jordan Zimmerman (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CURATOR-87) new LeaderLatch "jitters" after network outage
Date Sat, 24 May 2014 15:40:01 GMT

     [ https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jordan Zimmerman resolved CURATOR-87.
-------------------------------------

    Resolution: Not a Problem

I agree with Evaristo here. Also, please note, there has been other work on background stability,
etc. that may mitigate the OP's issues.

> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
>                 Key: CURATOR-87
>                 URL: https://issues.apache.org/jira/browse/CURATOR-87
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>         Environment: OS-X
>            Reporter: Oliver Dain
>            Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper becomes unreachable
(due to network issues or something). I do know that I could maintain the same LeaderLatch
instance and when ZK becomes reachable again it would re-negotiate leadership. However, for
my particular use case this doesn't work and I have to release the LeaderLatch. Later, when
ZK is available again I allocate a new LeaderLatch instance and call start() and on it. The
bug is that this when await() is called on the new latch it immediately calls the isLeader
callback and then almost immediately after the await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
>  @Test
>     public void leaderLatchJitters() throws Exception {
>         TestingServer server = new TestingServer();
>         CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(),
>                 new ExponentialBackoffRetry(1000, 3));
>         zkClient.start();
>         LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         final AtomicInteger numIsLeader = new AtomicInteger(0);
>         final AtomicInteger numNotLeader = new AtomicInteger(0);
>         LeaderLatchListener lll = new LeaderLatchListener() {
>             @Override
>             public void isLeader() {
>                 log.debug("isLeader called");
>                 numIsLeader.incrementAndGet();
>             }
>             @Override
>             public void notLeader() {
>                 log.debug("notLeader called");
>                 numNotLeader.incrementAndGet();
>             }
>         };
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         leaderLatch.start();
>         leaderLatch.await();
>         assertTrue(leaderLatch.hasLeadership());
>         assertEquals(1, numIsLeader.get());
>         assertEquals(0, numNotLeader.get());
>         // Shut down the server, wait for us to lose the lock, then restart
>         File zkTmpDir = server.getTempDirectory();
>         int zkServerPort = server.getPort();
>         server.stop();
>         while (leaderLatch.hasLeadership()) {
>             log.debug("Waiting for curator to notice it's not the leader");
>             Thread.sleep(100);
>         }
>         log.debug("Curator has noticed that it is no longer the leader");
>         assertEquals(1, numNotLeader.get());
>         assertEquals(1, numIsLeader.get());
>         leaderLatch.close();
>         // Restart ZooKeeper
>         server = new TestingServer(zkServerPort, zkTmpDir);
>         leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         log.debug("Calling leaderLatch.start()");
>         leaderLatch.start();
>         log.debug("Trying to regain leadership");
>         leaderLatch.await();
>         log.debug("We have regained leadership");
>         // Wait so we have time to observe the "jitter"
>         Thread.sleep(100);
>         assertTrue(leaderLatch.hasLeadership());
>         // Bug here. numIsLeader == 3
>         assertEquals(2, numIsLeader.get());
>         // Bug here too, numNotLeader == 2
>         assertEquals(1, numNotLeader.get());
>         log.debug("calling leaderLatch.close");
>         leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0    [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 104  [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
> 132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Curator has
noticed that it is no longer the leader
> 171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Calling leaderLatch.start()
> 172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Trying to regain
leadership
> 1882 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We have regained
leadership
> 1883 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
notLeader called
> 1885 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 2084 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<<
FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message