zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuriy Lopotun <yuriy.lopo...@gmail.com>
Subject Re: Zookeeper-Zoodiscovery auto reconnect issue
Date Wed, 15 Apr 2015 20:49:56 GMT
Thanks for your reply.
I agree that zooKeeper.getState().isAlive() is a good way to check the
state.

But notice that after sending the Disconnected event (inside the while
loop) it would almost immediately proceed to the next loop iteration.
So, "while (zooKeeper.state.isAlive())" at this moment has a hight chance
to still evaluate to true, because Zoodiscovery would at the same time
trigger a chain of method invocations:
ZooKeeper.close() -> ClientCnxn.close() -> disconnect() ->
sendThread.close() -> zooKeeper.state = States.CLOSED
which has a high chance to take more time to execute than a condition
evaluation.

So, ZooKeeper will invoke startConnect() at least 1 time, which will
trigger a re-connect. At the same time ZooDiscovery, as I mentioned,
triggered ZooKeeper.close(), which will try to close the new ZooKeeper
connection.
I'm trying to find a way to avoid this situation...

Yuriy

2015-04-15 16:15 GMT-04:00 Camille Fournier <camille@apache.org>:

> So we have the notion of state that you can check.
> zooKeeper.getState().isAlive() will tell you if the client is actually
> alive or not.
>
> Looking through the code I'm not 100% sure why we are sending the
> Disconnected state change after the while loop, or if the code ever would,
> since the state should not be alive at that point (or else it wouldn't have
> left the while loop).
>
> In general though it sounds like a bug in the discovery side as you said. A
> check for the state liveness (are we closed/auth_failed or just
> disconnected) should fix this, I think.
>
> C
>
> On Wed, Apr 15, 2015 at 1:46 PM, Yuriy Lopotun <yuriy.lopotun@gmail.com>
> wrote:
>
> > Hi guys,
> >
> >
> > In our client-server OSGI application we are using ECF Zoodiscovery
> > provider for remote services discovery which uses Zookeeper (v.3.3.3)
> under
> > the hood. When testing the application resiliency, we noticed that when
> > unplugging/plugging back the network cable, the client in some cases
> > doesn’t get back remote OSGI services from the server.
> >
> >
> > I started debugging this use case and found out that in case of session
> > timeout both Zookeeper internally and Zoodiscovery try reconnecting
> > simultaneously:
> >
> > 1) Zookeeper internally:
> >
> > in ClientCnxn.SendThread.run() in case of SessionTimeoutException it
> closes
> > socket connection in cleanup(), sends the disconnect event to watchers
> and
> > reconnects in startConnect().
> >
> > 2) Zoodiscovery:
> >
> > Watcher receives the disconnect event from Zookeeper and closes/reopens a
> > new connection by:
> >
> > // discard the current stale reader
> >
> > this.readKeeper.close();
> >
> > // try reconnecting
> >
> > this.readKeeper = new ZooKeeper(this.ip, 3000, this);
> >
> >
> >
> > This results in a connect-disconnect-connect operation (since
> Zoodiscovery
> > closes the just reopened by Zookeeper connection and creates a new one)
> > instead of just one connect. Moreover, this also sometimes results in an
> > inconsistent client state – connection finally gets re-established, but
> the
> > client doesn’t ask the server for the remote services.
> >
> >
> > I think that the issue in this case is on the Zoodiscovery’s side – it
> > should not trigger hard disconnect/reconnect in cases when Zookeeper does
> > it internally. However, I’m not sure how it could distinguish these
> cases,
> > because Zookeeper sends an identical disconnect event regardless of
> whether
> > or not it’s going to re-connect internally:
> >
> > eventThread.queueEvent(new WatchedEvent(
> >
> >                        Event.EventType.None,
> >
> >                        Event.KeeperState.Disconnected,
> >
> >                        null));
> >
> > is in both ClientCnxn.SendThread catch block within the while loop and
> just
> > after it.
> >
> >
> > So, I wanted to ask for your suggestion of how to better handle the
> > disconnect cases to avoid double reconnects and initiate hard reconnect
> > from Zoodiscovery only when Zookeper doesn’t do it internally.
> >
> >
> > Thanks,
> >
> > Yuriy
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message