curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Brown <Matthew.Br...@citrix.com>
Subject Re: Confused about the LeaderLatch - what should happen on ConnectionState.SUSPENDED and ConnectionState.LOST ?
Date Wed, 19 Mar 2014 13:00:11 GMT
> My assumption and desired behaviour is that the user should suspend operations - which
implies to me that its leadership status is uncertain. (I am holding off all persistent operations
for example).
> But -I think- this also implies that no-one else can become leader yet - we either have
the old-leader still be leader, and no one else, or then the old-leader disappeared and we
are in effect leaderless for some time.

I think the second part of this is incorrect – if client 1 has lost it's zookeeper connection,
it doesn't imply that other clients have also lost their zookeeper connection.

So it would be correct for the former leader who now has a suspended connection to cease it's
leader activities – but other clients who are still connected to the ensemble may have become
the leader due to the suspension of client 1's connection.

If client 1 still acted as if it still might be the leader when it's connection becomes suspended,
then you would have two leaders – client 1 and whatever client which that still has a healthy
ZK connection which grabbed the latch.

>From the perspective of the zookeeper ensemble, it can't know if your client is suffering
from a "short connection break" or if it has died altogether – so the client's leader role
should be treated as lost in either case.

From: Robert Kamphuis <Robert.Kamphuis@supercell.com<mailto:Robert.Kamphuis@supercell.com>>
Reply-To: "user@curator.apache.org<mailto:user@curator.apache.org>" <user@curator.apache.org<mailto:user@curator.apache.org>>
Date: Wednesday, March 19, 2014 at 6:18 AM
To: "user@curator.apache.org<mailto:user@curator.apache.org>" <user@curator.apache.org<mailto:user@curator.apache.org>>
Cc: Robert Kamphuis <Robert.Kamphuis@supercell.com<mailto:Robert.Kamphuis@supercell.com>>
Subject: Confused about the LeaderLatch - what should happen on ConnectionState.SUSPENDED
and ConnectionState.LOST ?


Hi,

I have been working on changing our application to work with Zookeeper and Curator for some
while now, and are occasionally getting wrong behaviour out of my system.
The symptom I’m getting is that two servers are concluding that they are the leader of a
particular task/leaderlatch at the same time, braking everything in my application.
This does not happen too often - but often enough and it is bad enough for my application.
I can get it pretty consistently occurring by restarting one of the servers in our 5-server
zookeeper ensembles in turns,
while having multiple servers queuing up for the same leader latch.

My key question is the following:
- WHAT should a user of a leaderLatch do when the connectionState goes to suspended?

My assumption and desired behaviour is that the user should suspend operations - which implies
to me that its leadership status is uncertain. (I am holding off all persistent operations
for example).
But -I think- this also implies that no-one else can become leader yet - we either have the
old-leader still be leader, and no one else, or then the old-leader disappeared and we are
in effect leaderless for some time.
This will then be followed by
a) a reconnect - in which case the old leader can continue its stuff (and optionally double
check its leadership status) or
b) a lost - in which case the old leader lost its leadership and should release all its power
etc and try again or do something else. Someone else likely became leader in my application
by then.
The a) or b) is controlled by the SessionTimeout negotiated between the curator/zookeeper
client and zookeeper ensemble.

Is my thinking correct here?
and if so, why is the curator’s LeaderLatch.handleStateChange(ConnectionState newState)
handling both in the same way : setLeadership(false)

In my application, a leadership change is a pretty big event, due to the amount of work the
code does, and I really want leadership to remain between short connection-breaks - eg. one
of the zookeeper servers crashes. Leadership should only be swapped on a sessiontimeout -
eg. broken application node, or long network break between the server and the zookeeper servers.
I am thinking to use 90 second as session timeout (so to survive eg. longer GC breaks and
similar without leadership change) - maybe even longer.

Is this a bug in leader latch, or should I use something else than leader latch, or implement
my desired behaviour in a new recipe?

kind regards,
Robert Kamphuis

PS. using zookeeper3.4.5 and curator2.4.0


Mime
View raw message