curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Blum <dragonsi...@gmail.com>
Subject Re: How to obtain stable leader election over unstable ZK connections
Date Thu, 20 Aug 2015 16:19:20 GMT
Assuming that clocks are usually not too out of step, Curator should be
able to infer when the server would have terminated the existing session
based on the clock.  A little bit of thought would need to be put into
resolving the race condition when you reconnect right as you were about to
time out, in order to present a unified view of the state change, but that
doesn't seem infeasible.  This seems like exactly the kind of problem
Curator should be solving.

On Thu, Aug 20, 2015 at 11:21 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> Yeah, in hindsight LOST isn’t useful which is why all the recipes refer to
> SUSPENDED. Having a session-expired state is complicated in Curator as
> Curator sometimes re-creates the connection without a ZK generated event.
> So, the SESSION lost would have to be inferred.
>
> -Jordan
>
>
>
> On August 20, 2015 at 10:13:19 AM, Scott Blum (dragonsinth@gmail.com)
> wrote:
>
> Ahh... that is confusing, and seems dubiously useful.  I think 99% of the
> time I'd rather get an event that represents that the session is definitely
> lost.
>
> On Thu, Aug 20, 2015 at 10:53 AM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED
>> vs.
>> LOST was all about?
>>
>> It’s a big source of confusion with Curator. LOST does _not_ mean the
>> session was lost. It means Curator has given up after retries, etc. Because
>> Curator re-creates ZK handles internally the notion of a “session” is more
>> complicated than using raw ZooKeeper.
>>
>>
>> -Jordan
>>
>>
>>
>>
>> On August 20, 2015 at 9:50:56 AM, Scott Blum (dragonsinth@gmail.com)
>> wrote:
>>
>> Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED
>> vs.
>> LOST was all about?
>>
>> Maybe the recipes just need to be tweaked a bit?
>>
>> I always assumed emphemeral nodes would be gone on LOST but not gone if
>> you
>> get a SUSPENDED followed by RECONNECTED.
>>
>> The one question I've always wondered is what happens to Watchers on
>> SUSPENDED, do they all need to be re-applied, or will they still fire
>> later
>> as long as you don't get LOST?
>>
>> On Thu, Aug 20, 2015 at 10:41 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>> > I wonder if we can add error handling policies to Curator. Currently,
>> the
>> > policy of all recipes is hard-coded to treat SUSPENDED as a type of lost
>> > session. We could change this to be injected like the retry policy. To
>> > solve this particular issue we’d also need to introduce a SESSION_LOST
>> > state of some type. This is complicated as Curator re-creates
>> connections
>> > internally.
>> >
>> > Thoughts?
>> >
>> > -Jordan
>> >
>> >
>> >
>> > On August 20, 2015 at 2:10:52 AM, Dong Lei (donglei@microsoft.com)
>> wrote:
>> >
>> > Hi curator-devs:
>> >
>> > We use Spark in standalone mode in which Spark leverage curator to
>> manage
>> > ZK connections and elect leader. Our Zookeeper may be not very stable
>> and
>> > we get "session suspended and reconnected" sometimes. The problem is
>> that
>> > this kind of disassociated and reconnected triggers leader election
>> quite
>> > often. And Spark's reaction to leadership switching can be very costly.
>> >
>> > So I'm thinking about whether it's possible to tolerate such failure
>> cases
>> > if we can reconnect soon and the session is actually kept after the
>> > reconnection?
>> > Or does such a requirement makes sense to you?
>> >
>> > Any advice will be appreciated.
>> >
>> >
>> > Thanks
>> > Dong Lei
>> >
>> >
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message