Mailing-List: contact user-help@curator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@curator.apache.org
Received-SPF: pass (nike.apache.org: domain of Matthew.Brown@citrix.com
 designates 66.165.176.63 as permitted sender)
From: Matt Brown <Matthew.Brown@citrix.com>
To: "user@curator.apache.org" <user@curator.apache.org>, Robert Kamphuis
	<Robert.Kamphuis@supercell.com>
Subject: Re: Confused about the LeaderLatch - what should happen on
 ConnectionState.SUSPENDED and ConnectionState.LOST ?
Thread-Topic: Confused about the LeaderLatch - what should happen on
 ConnectionState.SUSPENDED and ConnectionState.LOST ?
Thread-Index: AQHPQ02nBaPebgwBQkqWz/L4p+Avx5rokfsA
Date: Wed, 19 Mar 2014 13:00:11 +0000
Message-ID: <CF4F0BAB.4F4B%matthew.brown@citrix.com>
References: <2BF68D8D-7BFB-42C7-AC62-8079539BFBCF@supercell.com>
 <300227C1-7D23-48D7-8B4A-F86FBC51DDA5@supercell.com>
In-Reply-To: <300227C1-7D23-48D7-8B4A-F86FBC51DDA5@supercell.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.3.9.131030
Content-Type: multipart/alternative;
	boundary="_000_CF4F0BAB4F4Bmatthewbrowncitrixcom_"
MIME-Version: 1.0

--_000_CF4F0BAB4F4Bmatthewbrowncitrixcom_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

> My assumption and desired behaviour is that the user should suspend opera=
tions - which implies to me that its leadership status is uncertain. (I am =
holding off all persistent operations for example).
> But -I think- this also implies that no-one else can become leader yet - =
we either have the old-leader still be leader, and no one else, or then the=
 old-leader disappeared and we are in effect leaderless for some time.

I think the second part of this is incorrect =96 if client 1 has lost it's =
zookeeper connection, it doesn't imply that other clients have also lost th=
eir zookeeper connection.

So it would be correct for the former leader who now has a suspended connec=
tion to cease it's leader activities =96 but other clients who are still co=
nnected to the ensemble may have become the leader due to the suspension of=
 client 1's connection.

If client 1 still acted as if it still might be the leader when it's connec=
tion becomes suspended, then you would have two leaders =96 client 1 and wh=
atever client which that still has a healthy ZK connection which grabbed th=
e latch.

>From the perspective of the zookeeper ensemble, it can't know if your clien=
t is suffering from a "short connection break" or if it has died altogether=
 =96 so the client's leader role should be treated as lost in either case.

From: Robert Kamphuis <Robert.Kamphuis@supercell.com<mailto:Robert.Kamphuis=
@supercell.com>>
Reply-To: "user@curator.apache.org<mailto:user@curator.apache.org>" <user@c=
urator.apache.org<mailto:user@curator.apache.org>>
Date: Wednesday, March 19, 2014 at 6:18 AM
To: "user@curator.apache.org<mailto:user@curator.apache.org>" <user@curator=
.apache.org<mailto:user@curator.apache.org>>
Cc: Robert Kamphuis <Robert.Kamphuis@supercell.com<mailto:Robert.Kamphuis@s=
upercell.com>>
Subject: Confused about the LeaderLatch - what should happen on ConnectionS=
tate.SUSPENDED and ConnectionState.LOST ?


Hi,

I have been working on changing our application to work with Zookeeper and =
Curator for some while now, and are occasionally getting wrong behaviour ou=
t of my system.
The symptom I=92m getting is that two servers are concluding that they are =
the leader of a particular task/leaderlatch at the same time, braking every=
thing in my application.
This does not happen too often - but often enough and it is bad enough for =
my application. I can get it pretty consistently occurring by restarting on=
e of the servers in our 5-server zookeeper ensembles in turns,
while having multiple servers queuing up for the same leader latch.

My key question is the following:
- WHAT should a user of a leaderLatch do when the connectionState goes to s=
uspended?

My assumption and desired behaviour is that the user should suspend operati=
ons - which implies to me that its leadership status is uncertain. (I am ho=
lding off all persistent operations for example).
But -I think- this also implies that no-one else can become leader yet - we=
 either have the old-leader still be leader, and no one else, or then the o=
ld-leader disappeared and we are in effect leaderless for some time.
This will then be followed by
a) a reconnect - in which case the old leader can continue its stuff (and o=
ptionally double check its leadership status) or
b) a lost - in which case the old leader lost its leadership and should rel=
ease all its power etc and try again or do something else. Someone else lik=
ely became leader in my application by then.
The a) or b) is controlled by the SessionTimeout negotiated between the cur=
ator/zookeeper client and zookeeper ensemble.

Is my thinking correct here?
and if so, why is the curator=92s LeaderLatch.handleStateChange(ConnectionS=
tate newState) handling both in the same way : setLeadership(false)

In my application, a leadership change is a pretty big event, due to the am=
ount of work the code does, and I really want leadership to remain between =
short connection-breaks - eg. one of the zookeeper servers crashes. Leaders=
hip should only be swapped on a sessiontimeout - eg. broken application nod=
e, or long network break between the server and the zookeeper servers. I am=
 thinking to use 90 second as session timeout (so to survive eg. longer GC =
breaks and similar without leadership change) - maybe even longer.

Is this a bug in leader latch, or should I use something else than leader l=
atch, or implement my desired behaviour in a new recipe?

kind regards,
Robert Kamphuis

PS. using zookeeper3.4.5 and curator2.4.0


--_000_CF4F0BAB4F4Bmatthewbrowncitrixcom_
Content-Type: text/html; charset="Windows-1252"
Content-ID: <88366D202CE7C4408C812C0687391D69@citrix.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252">
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif;">
<div>&gt; My assumption and desired behaviour is that the user should suspe=
nd operations - which implies to me that its leadership status is uncertain=
.&nbsp;(I am holding off all persistent operations for example).&nbsp;</div=
>
<div>&gt; But -I think- this also implies that no-one else can become leade=
r yet - we either have the old-leader still be leader, and no one else, or =
then the old-leader disappeared and we are in effect leaderless for some ti=
me.</div>
<div><br>
</div>
<div>I think the second part of this is incorrect =96 if client 1 has lost =
it's zookeeper connection, it doesn't imply that other clients have also lo=
st their zookeeper connection.</div>
<div><br>
</div>
<div>So it would be correct for the former leader who now has a suspended c=
onnection to cease it's leader activities =96 but other clients who are sti=
ll connected to the ensemble may have become the leader due to the suspensi=
on of client 1's connection.</div>
<div><br>
</div>
<div>If client 1 still acted as if it still might be the leader when it's c=
onnection becomes suspended, then you would have two leaders =96 client 1 a=
nd whatever client which that still has a healthy ZK connection which grabb=
ed the latch.</div>
<div><br>
</div>
<div>From the perspective of the zookeeper ensemble, it can't know if your =
client is suffering from a &quot;short connection break&quot; or if it has =
died altogether =96 so the client's leader role should be treated as lost i=
n either case.</div>
<div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Robert Kamphuis &lt;<a href=
=3D"mailto:Robert.Kamphuis@supercell.com">Robert.Kamphuis@supercell.com</a>=
&gt;<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@curator.apache.org">user@curator.apache.org</a>&quot; &lt;<a href=3D"ma=
ilto:user@curator.apache.org">user@curator.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Wednesday, March 19, 2014 at =
6:18 AM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@cu=
rator.apache.org">user@curator.apache.org</a>&quot; &lt;<a href=3D"mailto:u=
ser@curator.apache.org">user@curator.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Cc: </span>Robert Kamphuis &lt;<a href=3D"=
mailto:Robert.Kamphuis@supercell.com">Robert.Kamphuis@supercell.com</a>&gt;=
<br>
<span style=3D"font-weight:bold">Subject: </span>Confused about the LeaderL=
atch - what should happen on ConnectionState.SUSPENDED and ConnectionState.=
LOST ?<br>
</div>
<div><br>
</div>
<div>
<div style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line=
-break: after-white-space;">
<br>
<div>Hi,&nbsp;<br>
<font color=3D"#0f61c8"><br>
</font>I have been working on changing our application to work with Zookeep=
er and Curator for some while now, and are occasionally getting wrong behav=
iour out of my system.<br>
The symptom I=92m getting is that two servers are concluding that they are =
the leader of a particular task/leaderlatch at the same time, braking every=
thing in my application.<br>
This does not happen too often - but often enough and it is bad enough for =
my application. I can get it pretty consistently occurring by restarting on=
e of the servers in our 5-server zookeeper ensembles in turns,&nbsp;<br>
while having multiple servers queuing up for the same leader latch.&nbsp;<b=
r>
<font color=3D"#0f61c8"><br>
</font>My key question is the following:<br>
- WHAT should a user of a leaderLatch do when the connectionState goes to s=
uspended?<br>
<font color=3D"#0f61c8"><br>
</font>My assumption and desired behaviour is that the user should suspend =
operations - which implies to me that its leadership status is uncertain.&n=
bsp;(I am holding off all persistent operations for example).&nbsp;<br>
But -I think- this also implies that no-one else can become leader yet - we=
 either have the old-leader still be leader, and no one else, or then the o=
ld-leader disappeared and we are in effect leaderless for some time.<br>
This will then be followed by&nbsp;<br>
a) a reconnect - in which case the old leader can continue its stuff (and o=
ptionally double check its leadership status) or<br>
b) a lost - in which case the old leader lost its leadership and should rel=
ease all its power etc and try again or do something else. Someone else lik=
ely became leader in my application by then.<br>
The a) or b) is controlled by the SessionTimeout negotiated between the cur=
ator/zookeeper client and zookeeper ensemble.<br>
<font color=3D"#0f61c8"><br>
</font>Is my thinking correct here?<br>
and if so, why is the curator=92s LeaderLatch.handleStateChange(ConnectionS=
tate newState) handling both in the same way : setLeadership(false)<br>
<font color=3D"#0f61c8"><br>
</font>In my application, a leadership change is a pretty big event, due to=
 the amount of work the code does, and I really want leadership to remain b=
etween short connection-breaks - eg. one of the zookeeper servers crashes. =
Leadership should only be swapped
 on a sessiontimeout - eg. broken application node, or long network break b=
etween the server and the zookeeper servers. I am thinking to use 90 second=
 as session timeout (so to survive eg. longer GC breaks and similar without=
 leadership change) - maybe even
 longer.<br>
<font color=3D"#0f61c8"><br>
</font>Is this a bug in leader latch, or should I use something else than l=
eader latch, or implement my desired behaviour in a new recipe?<br>
<font color=3D"#0f61c8"><br>
</font>kind regards,<br>
Robert Kamphuis<br>
<font color=3D"#0f61c8"><br>
</font>
<div>PS. using zookeeper3.4.5 and curator2.4.0</div>
</div>
<br>
</div>
</div>
</span></div>
</body>
</html>

--_000_CF4F0BAB4F4Bmatthewbrowncitrixcom_--