curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evaristo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-87) new LeaderLatch "jitters" after network outage
Date Fri, 23 May 2014 21:40:01 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007745#comment-14007745
] 

Evaristo commented on CURATOR-87:
---------------------------------

Hi there,

In my opinion this test is not following the API contract

LeaderLatch does not require to be closed when connection with ZK is closed. You can use ConnectionListener
to track that errors, and once you are reconnected the same LeaderLatch can be used.

I created a test similar to your test without closing the latch and the etst is always passed

	@Test
	public void leaderLatchJittersBis() throws Exception {
		TestingServer server = new TestingServer();
		CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(),
				new ExponentialBackoffRetry(1000, 3));
		zkClient.start();
		LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
		final AtomicInteger numIsLeader = new AtomicInteger(0);
		final AtomicInteger numNotLeader = new AtomicInteger(0);
		LeaderLatchListener lll = new LeaderLatchListener() {
			@Override
			public void isLeader()
			{ System.out.println("isLeader called"); numIsLeader.incrementAndGet(); }
			@Override
			public void notLeader()
			{ System.out.println("notLeader called"); numNotLeader.incrementAndGet(); }
		};
		leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
		leaderLatch.start();
		leaderLatch.await();
		assertTrue(leaderLatch.hasLeadership());
		assertEquals(1, numIsLeader.get());
		assertEquals(0, numNotLeader.get());
		// Shut down the server, wait for us to lose the lock, then restart
		File zkTmpDir = server.getTempDirectory();
		int zkServerPort = server.getPort();
		server.stop();
		while (leaderLatch.hasLeadership())
		{ System.out.println("Waiting for curator to notice it's not the leader"); Thread.sleep(100);
}
		System.out.println("Curator has noticed that it is no longer the leader");
		assertEquals(1, numNotLeader.get());
		assertEquals(1, numIsLeader.get());
		// Restart ZooKeeper
		server = new TestingServer(zkServerPort, zkTmpDir);
		
		System.out.println("Trying to regain leadership");
		leaderLatch.await();
		System.out.println("We have regained leadership");
		// Wait so we have time to observe the "jitter"
		Thread.sleep(100);
		assertTrue(leaderLatch.hasLeadership());
		// Bug here. numIsLeader == 3
		assertEquals(2, numIsLeader.get());
		// Bug here too, numNotLeader == 2
		assertEquals(1, numNotLeader.get());
		System.out.println("calling leaderLatch.close");
		leaderLatch.close();
	}


On the other hand, your original test is always passing to me if I add a Thread.sleep(100)

		leaderLatch.close();
		Thread.sleep(100);
		// Restart ZooKeeper
		server = new TestingServer(zkServerPort, zkTmpDir);


My feeling is that your test is generating some kind of "artiificial"  race condition. close()
method for the LeaderLatch is calling background operations so you can see that as an async
operation.

What do you think Jordan? I can take a deeper look if you think this is really a bug


> new LeaderLatch "jitters" after network outage
> ----------------------------------------------
>
>                 Key: CURATOR-87
>                 URL: https://issues.apache.org/jira/browse/CURATOR-87
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>         Environment: OS-X
>            Reporter: Oliver Dain
>            Priority: Minor
>
> I have a LeaderLatch that has become the leader. Then all of ZooKeeper becomes unreachable
(due to network issues or something). I do know that I could maintain the same LeaderLatch
instance and when ZK becomes reachable again it would re-negotiate leadership. However, for
my particular use case this doesn't work and I have to release the LeaderLatch. Later, when
ZK is available again I allocate a new LeaderLatch instance and call start() and on it. The
bug is that this when await() is called on the new latch it immediately calls the isLeader
callback and then almost immediately after the await() call returns, notLeader gets called.
> The following unit test reproduces the problem:
>  @Test
>     public void leaderLatchJitters() throws Exception {
>         TestingServer server = new TestingServer();
>         CuratorFramework zkClient = CuratorFrameworkFactory.newClient(server.getConnectString(),
>                 new ExponentialBackoffRetry(1000, 3));
>         zkClient.start();
>         LeaderLatch leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         final AtomicInteger numIsLeader = new AtomicInteger(0);
>         final AtomicInteger numNotLeader = new AtomicInteger(0);
>         LeaderLatchListener lll = new LeaderLatchListener() {
>             @Override
>             public void isLeader() {
>                 log.debug("isLeader called");
>                 numIsLeader.incrementAndGet();
>             }
>             @Override
>             public void notLeader() {
>                 log.debug("notLeader called");
>                 numNotLeader.incrementAndGet();
>             }
>         };
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         leaderLatch.start();
>         leaderLatch.await();
>         assertTrue(leaderLatch.hasLeadership());
>         assertEquals(1, numIsLeader.get());
>         assertEquals(0, numNotLeader.get());
>         // Shut down the server, wait for us to lose the lock, then restart
>         File zkTmpDir = server.getTempDirectory();
>         int zkServerPort = server.getPort();
>         server.stop();
>         while (leaderLatch.hasLeadership()) {
>             log.debug("Waiting for curator to notice it's not the leader");
>             Thread.sleep(100);
>         }
>         log.debug("Curator has noticed that it is no longer the leader");
>         assertEquals(1, numNotLeader.get());
>         assertEquals(1, numIsLeader.get());
>         leaderLatch.close();
>         // Restart ZooKeeper
>         server = new TestingServer(zkServerPort, zkTmpDir);
>         leaderLatch = new LeaderLatch(zkClient, "/path/to/lock");
>         leaderLatch.addListener(lll, MoreExecutors.sameThreadExecutor());
>         log.debug("Calling leaderLatch.start()");
>         leaderLatch.start();
>         log.debug("Trying to regain leadership");
>         leaderLatch.await();
>         log.debug("We have regained leadership");
>         // Wait so we have time to observe the "jitter"
>         Thread.sleep(100);
>         assertTrue(leaderLatch.hasLeadership());
>         // Bug here. numIsLeader == 3
>         assertEquals(2, numIsLeader.get());
>         // Bug here too, numNotLeader == 2
>         assertEquals(1, numNotLeader.get());
>         log.debug("calling leaderLatch.close");
>         leaderLatch.close();
> }
> The output from this is:
> Running com.threeci.commons.zkrecipes.TransactionalLockTest
> 0    [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 104  [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
> 132  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Curator has
noticed that it is no longer the leader
> 171  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Calling leaderLatch.start()
> 172  [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - Trying to regain
leadership
> 1882 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 1883 [main] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  - We have regained
leadership
> 1883 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
notLeader called
> 1885 [main-EventThread] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest  -
isLeader called
> 2084 [ConnectionStateManager-0] DEBUG com.threeci.commons.zkrecipes.TransactionalLockTest
 - notLeader called
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.632 sec <<<
FAILURE!
> java.lang.AssertionError: expected:<2> but was:<3>



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message