curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <>
Subject Re: Sometimes leader election ends up in two leaders
Date Thu, 15 May 2014 01:37:25 GMT
I don’t think the situation you describe can happen. Let’s walk through this:

— Time N — 
We have a single, correct leader and 2 nodes:

— Time N + D1 — 
ZooKeeper leader instance is restarted. Shortly thereafter, both Curator clients will exit
their doWork() loops and mark their nodes for deletion. Due to a failed connection, though
there are still the 2 nodes:
	lock-0000000240 (waiting to be deleted)
	lock-0000000241 (waiting to be deleted)

— Time N + D2 — 
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At this point,
there can be 2, 3 or 4 nodes depending. 
	lock-0000000240 (waiting to be deleted)
	lock-0000000241 (waiting to be deleted)
Neither of the instances will achieve leadership until the nodes 240/241 are deleted.

Of course, there may be something else that’s causing you to see 2 leaders. A while back
I discovered that rolling config changes can do it (
Or, there’s something else going on in Curator. 


From: stibi
Date: May 14, 2014 at 11:39:48 AM
Subject:  Sometimes leader election ends up in two leaders  


I'm using Curator's Leader Election recipe (2.4.2) and found a very hard-to-reproduce issue
which could lead to a situation where both clients become leader.

Let's say 2 clients are competing for leadership, client #1 is currently the leader and zookeeper
maintains the following structure under the leaderPath:

  |- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1)
  |- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2)

autoRequeue flag is set to true for both clients

Let's tigger a leader election by restarting the ZooKeeper leader.

When this happens, both clients will lose the connection to the ZooKeeper ensemble and will
try to re-acquire the LeaderSelector's mutex. Eventually (after the negotiated session timeout)
the ephemeral zNodes under /leaderPath will be deleted.

The problem occurs when ephemeral zNode deletions interleave with mutex acquisition.
Client #1 can observe that both zNodes (240 and 241) are already deleted, /leaderPath has
no children so it acquires the mutex successfully.

On the other hand, client #2 can observe that both zNodes still exist, so it starts to watch
zNode #240 (LockInternals.internalLockLoop():315). In a short period of time the watcher will
be notified about the zNode's deletion, so client #2 reenters LockInternals.internalLockLoop().

What is really strange that getSortedChildren() call in LockInternals:284 can still return
zNode #241
so it will succeed in acquiring the mutex (LockInternals:287)

The result is two clients, both leader, but /leaderPath contains only one zNode for client

Did you encounter similar problems before? Do you have any ideas on how to prevent such race
conditions? I can think of a solution: The leader should watch its zNode under /leaderPath
and interrupt leadership when the zNode gets deleted.

Thank you,
View raw message