Mailing-List: contact dev-help@curator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@curator.apache.org
Date: Sat, 24 May 2014 15:40:01 +0000 (UTC)
From: "Jordan Zimmerman (JIRA)" <jira@apache.org>
To: dev@curator.apache.org
Message-ID: <JIRA.12695116.1392337999865.14106.1400946001487@arcas>
In-Reply-To: <JIRA.12695116.1392337999865@arcas>
References: <JIRA.12695116.1392337999865@arcas>
Subject: [jira] [Resolved] (CURATOR-87) new LeaderLatch "jitters" after
 network outage
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jordan Zimmerman resolved CURATOR-87.
-------------------------------------

    Resolution: Not a Problem

I agree with Evaristo here. Also, please note, there has been other work on background stability, etc. that may mitigate the OP's issues.

> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
>                 Key: CURATOR-87
>                 URL: https://issues.apache.org/jira/browse/CURATOR-87
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>         Environment: OS-X
>            Reporter: Oliver Dain
>            Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper becomes unreachable (due to network issues or something). I do know that I could maintain the same LeaderLatch instance and when ZK becomes reachable again it would re-negotiate leadership. However, for my particular use case this doesn't work and I have to release the LeaderLatch. Later, when ZK is available again I allocate a new LeaderLatch instance and call start() and on it. The bug is that this when await() is called on the new latch it immediately calls the isLeader callback and then almost immediately after the await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
>  @Test
>     public void leaderLatchJitters() throws Exception {
>         TestingServer server = new TestingServer();
>         CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(),
>                 new ExponentialBackoffRetry(1000, 3));
>         zkClient.start();
>         LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         final AtomicInteger numIsLeader = new AtomicInteger(0);
>         final AtomicInteger numNotLeader = new AtomicInteger(0);
>         LeaderLatchListener lll = new LeaderLatchListener() {
>             @Override
>             public void isLeader() {
>                 log.debug("isLeader called");
>                 numIsLeader.incrementAndGet();
>             }
>             @Override
>             public void notLeader() {
>                 log.debug("notLeader called");
>                 numNotLeader.incrementAndGet();
>             }
>         };
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         leaderLatch.start();
>         leaderLatch.await();
>         assertTrue(leaderLatch.hasLeadership());
>         assertEquals(1, numIsLeader.get());
>         assertEquals(0, numNotLeader.get());
>         // Shut down the server, wait for us to lose the lock, then restart
>         File zkTmpDir = server.getTempDirectory();
>         int zkServerPort = server.getPort();
>         server.stop();
>         while (leaderLatch.hasLeadership()) {
>             log.debug("Waiting for curator to notice it's not the leader");
>             Thread.sleep(100);
>         }
>         log.debug("Curator has noticed that it is no longer the leader");
>         assertEquals(1, numNotLeader.get());
>         assertEquals(1, numIsLeader.get());
>         leaderLatch.close();
>         // Restart ZooKeeper
>         server = new TestingServer(zkServerPort, zkTmpDir);
>         leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         log.debug("Calling leaderLatch.start()");
>         leaderLatch.start();
>         log.debug("Trying to regain leadership");
>         leaderLatch.await();
>         log.debug("We have regained leadership");
>         // Wait so we have time to observe the "jitter"
>         Thread.sleep(100);
>         assertTrue(leaderLatch.hasLeadership());
>         // Bug here. numIsLeader == 3
>         assertEquals(2, numIsLeader.get());
>         // Bug here too, numNotLeader == 2
>         assertEquals(1, numNotLeader.get());
>         log.debug("calling leaderLatch.close");
>         leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0    [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 104  [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Curator has noticed that it is no longer the leader
> 171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Calling leaderLatch.start()
> 172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Trying to regain leadership
> 1882 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We have regained leadership
> 1883 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> 1885 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - isLeader called
> 2084 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<< FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>


--
This message was sent by Atlassian JIRA
(v6.2#6252)