curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: How to obtain stable leader election over unstable ZK connections
Date Fri, 21 Aug 2015 02:36:05 GMT
Done - I’d like to do the SESSION_LOST connection state. Who wants to do the pluggable error
handler? We can all work together on applying it to the recipes. This will be a lot of work.

-Jordan



On August 20, 2015 at 9:09:37 PM, Cameron McKenzie (mckenzie.cam@gmail.com) wrote:

I think that could be a good way forward.  

It will require some careful thought about which situations it is OK for a  
recipe to continue operating in the SUSPENDED state.  

I have implemented something similar for stuff at work.  

On Fri, Aug 21, 2015 at 3:14 AM, Jordan Zimmerman <  
jordan@jordanzimmerman.com> wrote:  

> So, should we change CURATOR-246 into a larger issue for supporting a true  
> SESSION_LOST state as well as pluggable error handling? If so, it would be  
> nice to have this for 3.0  
>  
> -Jordan  
>  
>  
>  
> On August 20, 2015 at 11:19:21 AM, Scott Blum (dragonsinth@gmail.com)  
> wrote:  
>  
> Assuming that clocks are usually not too out of step, Curator should be  
> able to infer when the server would have terminated the existing session  
> based on the clock. A little bit of thought would need to be put into  
> resolving the race condition when you reconnect right as you were about to  
> time out, in order to present a unified view of the state change, but that  
> doesn't seem infeasible. This seems like exactly the kind of problem  
> Curator should be solving.  
>  
> On Thu, Aug 20, 2015 at 11:21 AM, Jordan Zimmerman <  
> jordan@jordanzimmerman.com> wrote:  
> Yeah, in hindsight LOST isn’t useful which is why all the recipes refer to  
> SUSPENDED. Having a session-expired state is complicated in Curator as  
> Curator sometimes re-creates the connection without a ZK generated event.  
> So, the SESSION lost would have to be inferred.  
>  
> -Jordan  
>  
>  
>  
> On August 20, 2015 at 10:13:19 AM, Scott Blum (dragonsinth@gmail.com)  
> wrote:  
>  
> Ahh... that is confusing, and seems dubiously useful. I think 99% of the  
> time I'd rather get an event that represents that the session is definitely  
> lost.  
>  
> On Thu, Aug 20, 2015 at 10:53 AM, Jordan Zimmerman <  
> jordan@jordanzimmerman.com> wrote:  
> Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED  
> vs.  
> LOST was all about?  
> It’s a big source of confusion with Curator. LOST does _not_ mean the  
> session was lost. It means Curator has given up after retries, etc. Because  
> Curator re-creates ZK handles internally the notion of a “session” is more  
> complicated than using raw ZooKeeper.  
>  
>  
>  
> -Jordan  
>  
>  
>  
>  
>  
> On August 20, 2015 at 9:50:56 AM, Scott Blum (dragonsinth@gmail.com)  
> wrote:  
>  
> Maybe I'm confused, but I thought that's what ConnectionState SUSPENDED vs.  
> LOST was all about?  
>  
> Maybe the recipes just need to be tweaked a bit?  
>  
> I always assumed emphemeral nodes would be gone on LOST but not gone if you  
> get a SUSPENDED followed by RECONNECTED.  
>  
> The one question I've always wondered is what happens to Watchers on  
> SUSPENDED, do they all need to be re-applied, or will they still fire later  
> as long as you don't get LOST?  
>  
> On Thu, Aug 20, 2015 at 10:41 AM, Jordan Zimmerman <  
> jordan@jordanzimmerman.com> wrote:  
>  
> > I wonder if we can add error handling policies to Curator. Currently, the  
> > policy of all recipes is hard-coded to treat SUSPENDED as a type of lost  
> > session. We could change this to be injected like the retry policy. To  
> > solve this particular issue we’d also need to introduce a SESSION_LOST  
> > state of some type. This is complicated as Curator re-creates connections  
> > internally.  
> >  
> > Thoughts?  
> >  
> > -Jordan  
> >  
> >  
> >  
> > On August 20, 2015 at 2:10:52 AM, Dong Lei (donglei@microsoft.com)  
> wrote:  
> >  
> > Hi curator-devs:  
> >  
> > We use Spark in standalone mode in which Spark leverage curator to manage  
> > ZK connections and elect leader. Our Zookeeper may be not very stable and  
> > we get "session suspended and reconnected" sometimes. The problem is that  
> > this kind of disassociated and reconnected triggers leader election quite  
> > often. And Spark's reaction to leadership switching can be very costly.  
> >  
> > So I'm thinking about whether it's possible to tolerate such failure  
> cases  
> > if we can reconnect soon and the session is actually kept after the  
> > reconnection?  
> > Or does such a requirement makes sense to you?  
> >  
> > Any advice will be appreciated.  
> >  
> >  
> > Thanks  
> > Dong Lei  
> >  
> >  
>  
>  
>  

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message