zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Keeney <...@fitterweb.com>
Subject Ensemble fails when one node looses connectivity
Date Fri, 02 Mar 2018 01:43:36 GMT
I'm using Zookeeper with solr to create a cluster and I have come across
what seems like an unexpected behavior. The cluster is setup on AWS using
opsworks.  I am using a 3 node zookeeper ensemble. The zookeeper config on
all three nodes is:

clientPort=2181

dataDir=/var/opt/zookeeper/data

tickTime=2000

autopurge.purgeInterval=24

initLimit=100

syncLimit=5

server.1=172.31.86.130:2888:3888

server.2=172.31.16.234:2888:3888

server.3=172.31.73.122:2888:3888


Here is the issue:

If one node in the ensemble fails or is shut down the ensemble carries on.
However, when the node is restarted it's attempt to connect to the other
members of the cluster are rejected. The only way that I have found to
restore the ensemble is to restart all of the nodes within a short time
span of each other.

If I do that they are able to discover each other  carry on a proper leader
election and restore order.

Once they are restored everything is fine but if one of the nodes goes down
we are faced wit the same problem.

How do I ensure that if a node goes down, it can restart and rejoin the
ensemble with out having to manually restart all the other nodes?

Any help appreciated.

Thanks.

Jim K.




-- 
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887 <(703)%20568-5887>

*FitterWeb Consulting*
*Are you lean and agile enough? *

Mime
View raw message