Ah, maybe I didn't understand your suggestion correctly. If you meant that zooKeeper.state.isAlive() should be checked on Zoodiscovery side before triggering a reconnect - then this should indeed fix the issue. Thanks, Yuriy 2015-04-15 16:49 GMT-04:00 Yuriy Lopotun : > Thanks for your reply. > I agree that zooKeeper.getState().isAlive() is a good way to check the > state. > > But notice that after sending the Disconnected event (inside the while > loop) it would almost immediately proceed to the next loop iteration. > So, "while (zooKeeper.state.isAlive())" at this moment has a hight chance > to still evaluate to true, because Zoodiscovery would at the same time > trigger a chain of method invocations: > ZooKeeper.close() -> ClientCnxn.close() -> disconnect() -> > sendThread.close() -> zooKeeper.state = States.CLOSED > which has a high chance to take more time to execute than a condition > evaluation. > > So, ZooKeeper will invoke startConnect() at least 1 time, which will > trigger a re-connect. At the same time ZooDiscovery, as I mentioned, > triggered ZooKeeper.close(), which will try to close the new ZooKeeper > connection. > I'm trying to find a way to avoid this situation... > > Yuriy > > 2015-04-15 16:15 GMT-04:00 Camille Fournier : > > So we have the notion of state that you can check. >> zooKeeper.getState().isAlive() will tell you if the client is actually >> alive or not. >> >> Looking through the code I'm not 100% sure why we are sending the >> Disconnected state change after the while loop, or if the code ever would, >> since the state should not be alive at that point (or else it wouldn't >> have >> left the while loop). >> >> In general though it sounds like a bug in the discovery side as you said. >> A >> check for the state liveness (are we closed/auth_failed or just >> disconnected) should fix this, I think. >> >> C >> >> On Wed, Apr 15, 2015 at 1:46 PM, Yuriy Lopotun >> wrote: >> >> > Hi guys, >> > >> > >> > In our client-server OSGI application we are using ECF Zoodiscovery >> > provider for remote services discovery which uses Zookeeper (v.3.3.3) >> under >> > the hood. When testing the application resiliency, we noticed that when >> > unplugging/plugging back the network cable, the client in some cases >> > doesn’t get back remote OSGI services from the server. >> > >> > >> > I started debugging this use case and found out that in case of session >> > timeout both Zookeeper internally and Zoodiscovery try reconnecting >> > simultaneously: >> > >> > 1) Zookeeper internally: >> > >> > in ClientCnxn.SendThread.run() in case of SessionTimeoutException it >> closes >> > socket connection in cleanup(), sends the disconnect event to watchers >> and >> > reconnects in startConnect(). >> > >> > 2) Zoodiscovery: >> > >> > Watcher receives the disconnect event from Zookeeper and closes/reopens >> a >> > new connection by: >> > >> > // discard the current stale reader >> > >> > this.readKeeper.close(); >> > >> > // try reconnecting >> > >> > this.readKeeper = new ZooKeeper(this.ip, 3000, this); >> > >> > >> > >> > This results in a connect-disconnect-connect operation (since >> Zoodiscovery >> > closes the just reopened by Zookeeper connection and creates a new one) >> > instead of just one connect. Moreover, this also sometimes results in an >> > inconsistent client state – connection finally gets re-established, but >> the >> > client doesn’t ask the server for the remote services. >> > >> > >> > I think that the issue in this case is on the Zoodiscovery’s side – it >> > should not trigger hard disconnect/reconnect in cases when Zookeeper >> does >> > it internally. However, I’m not sure how it could distinguish these >> cases, >> > because Zookeeper sends an identical disconnect event regardless of >> whether >> > or not it’s going to re-connect internally: >> > >> > eventThread.queueEvent(new WatchedEvent( >> > >> > Event.EventType.None, >> > >> > Event.KeeperState.Disconnected, >> > >> > null)); >> > >> > is in both ClientCnxn.SendThread catch block within the while loop and >> just >> > after it. >> > >> > >> > So, I wanted to ask for your suggestion of how to better handle the >> > disconnect cases to avoid double reconnects and initiate hard reconnect >> > from Zoodiscovery only when Zookeper doesn’t do it internally. >> > >> > >> > Thanks, >> > >> > Yuriy >> > >> > >