curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: Leader Latch recovery after suspended state
Date Mon, 10 Mar 2014 14:52:01 GMT
Please provide an implementation/fix and submit a pull request on Github.

I also have a related question about not only re-use the znode, but imho, It would be great
that LeaderLatch can survive from teomprary ConnectionLossException (i.e., due to transient
network issue). 
Technically, this is already the case. When a RECONNECTED is received, LeaderLatch will attempt
to regain leadership. The problem is that when there is a network partition there is no way
to guarantee that you are still the leader. If there is Quorum in another segment of the cluster
a new leader might be elected there.

-JZ

From: chao chu chuchao333@gmail.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 10, 2014 at 9:39:50 AM
To: user@curator.incubator.apache.org user@curator.incubator.apache.org
Subject:  Re: Leader Latch recovery after suspended state  

Hi,

Just want to see if there is any progress on this?

I also have a related question about not only re-use the znode, but imho, It would be great
that LeaderLatch can survive from teomprary ConnectionLossException (i.e., due to transient
network issue). 

I guess in most cases, the context switch due to leader re-election is quite expensive, we
might not want to do that just because of some transient issue. if the current leader can
re-connect within the session timeout, it should still hold the leadership and no leader change
would happen during between. The similar rational like the differences between ConnestionLossException
(which is recoverable) and SessionExipredException (which is not recoverable).

what are your thoughts on this? Thanks a lot!

Regards,


On Wed, Aug 21, 2013 at 2:05 AM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
Yes, I was suggesting how to patch Curator.

On Aug 20, 2013, at 10:59 AM, Calvin Jia <jia.calvin@gmail.com> wrote:

Currently this is not supported in the Curator library, but the Curator library (specifically
leader latch's reset method) is the correct/logical place to add this feature if I want it?


On Tue, Aug 20, 2013 at 10:34 AM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
On reset() it could check to see if its node still exists. It would make the code a lot more
complicated though.

-JZ

On Aug 20, 2013, at 10:25 AM, Calvin Jia <jia.calvin@gmail.com> wrote:

A leader latch enters the suspended state after failing to receive a response from the first
ZK machine it heartbeats to (takes 2 thirds of the timeout). For the last 1 third, it tries
to contact another ZK machine. If it is successful, it will enter the state reconnected.

However, on reconnect, despite the fact the original node it created in ZK is still there,
it will create another ephemeral-sequential node (the reset method is called). This means
it will relinquish leadership, if there is another machine with a latch in the same path.

Is there any way to reconnect and reuse the original ZK node?

Thanks!






--
ChuChao
Mime
View raw message