zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Kelly <iv...@apache.org>
Subject Re: locking/leader election and dealing with session loss
Date Thu, 16 Jul 2015 17:54:12 GMT
I've seen 40s+.

Also, if combined with a network partition, the gc pause only needs 1/3 of
session timeout for the same effect to occur.

On Thu, 16 Jul 2015 15:58 Camille Fournier <camille@apache.org> wrote:

> They can and have happened in prod to people. I started taking about it
> after hearing enough people complain about just this situation on twitter.
> If you are relying on very large jvm memory footprints a 30s gc pause can
> and should be expected. In general I think most people don't need to worry
> about this most of the time but it's one of those things that happens and
> the developers are almost always shocked. I'm a fan of being clear about
> edge cases, even rare ones, so that devs can make the right tradeoffs for
> their env.
> Of course there are a myriad theoretical possibilities. But I don’t
> believe any of what you’ve mentioned will happen in production. For any
> reasonable case, you can be guaranteed that no two processes will consider
> themselves lock holders at the same instant in time.
>
> -Jordan
>
>
> On July 16, 2015 at 7:58:06 AM, Ivan Kelly (ivank@apache.org) wrote:
>
> On Thu, Jul 16, 2015 at 1:38 PM Jordan Zimmerman <
> jordan@jordanzimmerman.com>
> wrote:
>
> > Are you really seeing 30s gc pauses in production? If so, then of course
> > this could happen. However, if your application can tolerate a 30s pause
> > (which is hard to believe) then your session timeout is too low. The
> point
> > of the session timeout is to have enough coverage. So, if your app has 30
> > seconds allowable pauses your session timeout would have to be much
> longer.
> >
> GC is just an example. There's other ways the same scenario could happen.
> The machine could swap out the process due to load. Someone could do
> something stupid in the zookeeper event thread and the session expired
> event is delayed. The state update could have hit the ip stack during
> network partition, and the process then got wedged. The state update packet
> could have hit the network and been routed via the moon. The clock could
> break.
>
> If you are relying on a timer on the zk client to maintain a guarantee,
> then you really aren't giving any guarantee because the zk client doesn't
> have control over all the things that could go wrong.
>
> -Ivan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message