Return-Path: X-Original-To: apmail-curator-dev-archive@minotaur.apache.org Delivered-To: apmail-curator-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 068DD11996 for ; Sat, 24 May 2014 15:40:02 +0000 (UTC) Received: (qmail 99941 invoked by uid 500); 24 May 2014 15:40:01 -0000 Delivered-To: apmail-curator-dev-archive@curator.apache.org Received: (qmail 99888 invoked by uid 500); 24 May 2014 15:40:01 -0000 Mailing-List: contact dev-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@curator.apache.org Delivered-To: mailing list dev@curator.apache.org Received: (qmail 99879 invoked by uid 99); 24 May 2014 15:40:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 May 2014 15:40:01 +0000 Date: Sat, 24 May 2014 15:40:01 +0000 (UTC) From: "Jordan Zimmerman (JIRA)" To: dev@curator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (CURATOR-87) new LeaderLatch "jitters" after network outage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan Zimmerman resolved CURATOR-87. ------------------------------------- Resolution: Not a Problem I agree with Evaristo here. Also, please note, there has been other work on background stability, etc. that may mitigate the OP's issues. > new LeaderLatch "jitters" after network outage > ---------------------------------------------- > > Key: CURATOR-87 > URL: https://issues.apache.org/jira/browse/CURATOR-87 > Project: Apache Curator > Issue Type: Bug > Components: Recipes > Affects Versions: 2.2.0-incubating > Environment: OS-X > Reporter: Oliver Dain > Priority: Minor > > I have a LeaderLatch that has become the leader. Then all of ZooKeeper becomes unreachable (due to network issues or something). I do know that I could maintain the same LeaderLatch instance and when ZK becomes reachable again it would re-negotiate leadership. However, for my particular use case this doesn't work and I have to release the LeaderLatch. Later, when ZK is available again I allocate a new LeaderLatch instance and call start() and on it. The bug is that this when await() is called on the new latch it immediately calls the isLeader callback and then almost immediately after the await() call returns, notLeader gets called. > The following unit test reproduces the problem: > @Test > public void leaderLatchJitters() throws Exception { > TestingServer server = new TestingServer(); > CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(), > new ExponentialBackoffRetry(1000, 3)); > zkClient.start(); > LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock"); > final AtomicInteger numIsLeader = new AtomicInteger(0); > final AtomicInteger numNotLeader = new AtomicInteger(0); > LeaderLatchListener lll = new LeaderLatchListener() { > @Override > public void isLeader() { > log.debug("isLeader called"); > numIsLeader.incrementAndGet(); > } > @Override > public void notLeader() { > log.debug("notLeader called"); > numNotLeader.incrementAndGet(); > } > }; > leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor()); > leaderLatch.start(); > leaderLatch.await(); > assertTrue(leaderLatch.hasLeadership()); > assertEquals(1, numIsLeader.get()); > assertEquals(0, numNotLeader.get()); > // Shut down the server, wait for us to lose the lock, then restart > File zkTmpDir = server.getTempDirectory(); > int zkServerPort = server.getPort(); > server.stop(); > while (leaderLatch.hasLeadership()) { > log.debug("Waiting for curator to notice it's not the leader"); > Thread.sleep(100); > } > log.debug("Curator has noticed that it is no longer the leader"); > assertEquals(1, numNotLeader.get()); > assertEquals(1, numIsLeader.get()); > leaderLatch.close(); > // Restart ZooKeeper > server = new TestingServer(zkServerPort, zkTmpDir); > leaderLatch = new LeaderLatch(zkClient, "/path/to/lock"); > leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor()); > log.debug("Calling leaderLatch.start()"); > leaderLatch.start(); > log.debug("Trying to regain leadership"); > leaderLatch.await(); > log.debug("We have regained leadership"); > // Wait so we have time to observe the "jitter" > Thread.sleep(100); > assertTrue(leaderLatch.hasLeadership()); > // Bug here. numIsLeader == 3 > assertEquals(2, numIsLeader.get()); > // Bug here too, numNotLeader == 2 > assertEquals(1, numNotLeader.get()); > log.debug("calling leaderLatch.close"); > leaderLatch.close(); > } > The output from this is: > Running com.threeci.commons.zkrecipes.TransactionalLockTest > 0 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called > 104 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called > 132 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - Curator has noticed that it is no longer the leader > 171 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - Calling leaderLatch.start() > 172 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - Trying to regain leadership > 1882 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called > 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - We have regained leadership > 1883 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called > 1885 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - isLeader called > 2084 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest - notLeader called > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<3> -- This message was sent by Atlassian JIRA (v6.2#6252)