curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: How to obtain stable leader election over unstable ZK connections
Date Thu, 20 Aug 2015 17:14:56 GMT
So, should we change CURATOR-246 into a larger issue for supporting a true SESSION_LOST state
as well as pluggable error handling? If so, it would be nice to have this for 3.0

-Jordan



On August 20, 2015 at 11:19:21 AM, Scott Blum (dragonsinth@gmail.com) wrote:

Assuming that clocks are usually not too out of step, Curator should be able to infer when
the server would have terminated the existing session based on the clock.  A little bit of
thought would need to be put into resolving the race condition when you reconnect right as
you were about to time out, in order to present a unified view of the state change, but that
doesn't seem infeasible.  This seems like exactly the kind of problem Curator should be solving.

On Thu, Aug 20, 2015 at 11:21 AM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
Yeah, in hindsight LOST isn’t useful which is why all the recipes refer to SUSPENDED. Having
a session-expired state is complicated in Curator as Curator sometimes re-creates the connection
without a ZK generated event. So, the SESSION lost would have to be inferred.

-Jordan



On August 20, 2015 at 10:13:19 AM, Scott Blum (dragonsinth@gmail.com) wrote:

Ahh... that is confusing, and seems dubiously useful.  I think 99% of the time I'd rather
get an event that represents that the session is definitely lost.

On Thu, Aug 20, 2015 at 10:53 AM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED vs. 
LOST was all about? 
It’s a big source of confusion with Curator. LOST does _not_ mean the session was lost.
It means Curator has given up after retries, etc. Because Curator re-creates ZK handles internally
the notion of a “session” is more complicated than using raw ZooKeeper.



-Jordan





On August 20, 2015 at 9:50:56 AM, Scott Blum (dragonsinth@gmail.com) wrote:

Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED vs.
LOST was all about?

Maybe the recipes just need to be tweaked a bit?

I always assumed emphemeral nodes would be gone on LOST but not gone if you
get a SUSPENDED followed by RECONNECTED.

The one question I've always wondered is what happens to Watchers on
SUSPENDED, do they all need to be re-applied, or will they still fire later
as long as you don't get LOST?

On Thu, Aug 20, 2015 at 10:41 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> I wonder if we can add error handling policies to Curator. Currently, the
> policy of all recipes is hard-coded to treat SUSPENDED as a type of lost
> session. We could change this to be injected like the retry policy. To
> solve this particular issue we’d also need to introduce a SESSION_LOST
> state of some type. This is complicated as Curator re-creates connections
> internally.
>
> Thoughts?
>
> -Jordan
>
>
>
> On August 20, 2015 at 2:10:52 AM, Dong Lei (donglei@microsoft.com) wrote:
>
> Hi curator-devs:
>
> We use Spark in standalone mode in which Spark leverage curator to manage
> ZK connections and elect leader. Our Zookeeper may be not very stable and
> we get "session suspended and reconnected" sometimes. The problem is that
> this kind of disassociated and reconnected triggers leader election quite
> often. And Spark's reaction to leadership switching can be very costly.
>
> So I'm thinking about whether it's possible to tolerate such failure cases
> if we can reconnect soon and the session is actually kept after the
> reconnection?
> Or does such a requirement makes sense to you?
>
> Any advice will be appreciated.
>
>
> Thanks
> Dong Lei
>
>



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message