curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: Sometimes leader election ends up in two leaders
Date Thu, 22 May 2014 12:39:48 GMT
What guarantees that zNode 241 will be deleted prior to the (successful) attempt of client
#2 to reacquire the mutex using zNode 241?
Because that’s how the lock works. As long as 241 exists, no other client will consider
itself as having the mutex. 

reacquire the mutex using zNode 241?
This is not what happens. The client will try to acquire using a _different_ znode. Are you
thinking that 241 is re-used? It’s not. 

-JZ


From: stibi sulyan.tibor@gmail.com
Reply: stibi sulyan.tibor@gmail.com
Date: May 22, 2014 at 7:26:57 AM
To: Jordan Zimmerman jordan@jordanzimmerman.com, user@curator.apache.org user@curator.apache.org
Subject:  Re: Sometimes leader election ends up in two leaders  

Hi!

Thanks for the quick response.
About this step:

— Time N + D2 — 
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At this point,
there can be 2, 3 or 4 nodes depending. 
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)
lock-0000000242
lock-0000000243
Neither of the instances will achieve leadership until the nodes 240/241 are deleted.

What guarantees that zNode 241 will be deleted prior to the (successful) attempt of client
#2 to reacquire the mutex using zNode 241?
AFAIK node deletion is a background operation and a retry policy controls how often a deletion
attempt will occur (even for guaranteed deletes). Unlucky timing can lead to a situation where
deletion of zNode 241 happens after the mutex acquisition. In this case the mutex is not released
by the leader, but since the zNodes are deleted, the other client will also be elected as
leader.

Thanks,
Tibor



On Thu, May 15, 2014 at 3:37 AM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
I don’t think the situation you describe can happen. Let’s walk through this:

— Time N — 
We have a single, correct leader and 2 nodes:
lock-0000000240
lock-0000000241

— Time N + D1 — 
ZooKeeper leader instance is restarted. Shortly thereafter, both Curator clients will exit
their doWork() loops and mark their nodes for deletion. Due to a failed connection, though
there are still the 2 nodes:
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)

— Time N + D2 — 
The ZooKeeper quorum is repaired and the nodes start a doWork() loop again. At this point,
there can be 2, 3 or 4 nodes depending. 
lock-0000000240 (waiting to be deleted)
lock-0000000241 (waiting to be deleted)
lock-0000000242
lock-0000000243
Neither of the instances will achieve leadership until the nodes 240/241 are deleted.

Of course, there may be something else that’s causing you to see 2 leaders. A while back
I discovered that rolling config changes can do it (http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-considered-harmful-td7578761.html).
Or, there’s something else going on in Curator. 

-Jordan


From: stibi sulyan.tibor@gmail.com
Reply: user@curator.apache.org user@curator.apache.org
Date: May 14, 2014 at 11:39:48 AM
To: user@curator.apache.org user@curator.apache.org
Subject:  Sometimes leader election ends up in two leaders

Hi!

I'm using Curator's Leader Election recipe (2.4.2) and found a very hard-to-reproduce issue
which could lead to a situation where both clients become leader.

Let's say 2 clients are competing for leadership, client #1 is currently the leader and zookeeper
maintains the following structure under the leaderPath:

/leaderPath
  |- _c_a8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (client #1)
  |- _c_b5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (client #2)

autoRequeue flag is set to true for both clients

Let's tigger a leader election by restarting the ZooKeeper leader.

When this happens, both clients will lose the connection to the ZooKeeper ensemble and will
try to re-acquire the LeaderSelector's mutex. Eventually (after the negotiated session timeout)
the ephemeral zNodes under /leaderPath will be deleted.

The problem occurs when ephemeral zNode deletions interleave with mutex acquisition.
  
Client #1 can observe that both zNodes (240 and 241) are already deleted, /leaderPath has
no children so it acquires the mutex successfully.

On the other hand, client #2 can observe that both zNodes still exist, so it starts to watch
zNode #240 (LockInternals.internalLockLoop():315). In a short period of time the watcher will
be notified about the zNode's deletion, so client #2 reenters LockInternals.internalLockLoop().

What is really strange that getSortedChildren() call in LockInternals:284 can still return
zNode #241
so it will succeed in acquiring the mutex (LockInternals:287)

The result is two clients, both leader, but /leaderPath contains only one zNode for client
#1

Did you encounter similar problems before? Do you have any ideas on how to prevent such race
conditions? I can think of a solution: The leader should watch its zNode under /leaderPath
and interrupt leadership when the zNode gets deleted.

Thank you,
Tibor



Mime
View raw message