zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sampath Perera <samp...@adroitlogic.com>
Subject ZooKeeper quorum re-connection
Date Wed, 08 Jun 2011 18:35:57 GMT

First of all I must appreciate ZooKeeper, where I was able to get going with
it pretty fast and implemented clustering (coordination of nodes in the
cluster) for our product (UltraESB) just by going through the documentation
and a few searches of the mailing list.

Now, I was trying to run a sample setup with a ZooKeeper quorum of 3 nodes.
I have setup the ZooKeeper quorum locally on the localhost with giving
different election ports and client ports, and it seems to be like the
quorum is working fine. Then I have started 3 UltraESB server nodes pointing
to the quorum, I have noticed that a given UltraESB node connected to a
particular ZooKeeper node. Then to test the reliability, I have tried to
stop a ZooKeeper instance so that the 2 out of 3 ZK nodes are still alive,
and the quorum has to work.

What I have noticed when ever I stop the ZK node, the ESB server attached to
that node, gets a Discconected keeper state watched event, (upon receiving
this event I have registered a handler to stop the ESB cluster manager as
this means the ZK connection was lost). Now I do not see ZK client trying to
re-create the session with another node in the quorum...?

Could it be due to some problem in the way I have implemented the watched
event processing? or do we manually need to re-connect to the quorum once we
receive a Disconnected event?

Further I have been using ephemeral nodes, and I want to get the same
session, so I have tried to re-create the ZK session with creating a new ZK
instance from the ESB (client) side with passing the previous session id and
the session paswd, this caused the other 2 ESB servers to receive
Disconnected events too, but still I noticed that the ZK quorum was running
fine with the 2 nodes that it had up and running and those 2 nodes got into
a infinite loop due to the disconnect and then me trying to recreate ZK
session and soon the system received "Too many open files error" probably
due to running out of files with opened sockets (I am on unix)

Any help in understanding this quorum re-connection would be really
appreciated? Is there any documentation for this? If there is any please
bare with me and point to the documentation.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message