zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: locking/leader election and dealing with session loss
Date Thu, 16 Jul 2015 18:45:34 GMT
And a new Curator Tech Note to match:



On July 16, 2015 at 12:54:29 PM, Ivan Kelly (ivank@apache.org) wrote:

I've seen 40s+.  

Also, if combined with a network partition, the gc pause only needs 1/3 of  
session timeout for the same effect to occur.  

On Thu, 16 Jul 2015 15:58 Camille Fournier <camille@apache.org> wrote:  

> They can and have happened in prod to people. I started taking about it  
> after hearing enough people complain about just this situation on twitter.  
> If you are relying on very large jvm memory footprints a 30s gc pause can  
> and should be expected. In general I think most people don't need to worry  
> about this most of the time but it's one of those things that happens and  
> the developers are almost always shocked. I'm a fan of being clear about  
> edge cases, even rare ones, so that devs can make the right tradeoffs for  
> their env.  
> Of course there are a myriad theoretical possibilities. But I don’t  
> believe any of what you’ve mentioned will happen in production. For any  
> reasonable case, you can be guaranteed that no two processes will consider  
> themselves lock holders at the same instant in time.  
> -Jordan  
> On July 16, 2015 at 7:58:06 AM, Ivan Kelly (ivank@apache.org) wrote:  
> On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman <  
> jordan@jordanzimmerman.com>  
> wrote:  
> > Are you really seeing 30s gc pauses in production? If so, then of course  
> > this could happen. However, if your application can tolerate a 30s pause  
> > (which is hard to believe) then your session timeout is too low. The  
> point  
> > of the session timeout is to have enough coverage. So, if your app has 30  
> > seconds allowable pauses your session timeout would have to be much  
> longer.  
> >  
> GC is just an example. There's other ways the same scenario could happen.  
> The machine could swap out the process due to load. Someone could do  
> something stupid in the zookeeper event thread and the session expired  
> event is delayed. The state update could have hit the ip stack during  
> network partition, and the process then got wedged. The state update packet  
> could have hit the network and been routed via the moon. The clock could  
> break.  
> If you are relying on a timer on the zk client to maintain a guarantee,  
> then you really aren't giving any guarantee because the zk client doesn't  
> have control over all the things that could go wrong.  
> -Ivan  

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message