zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Washko, Daniel" <dwas...@gannett.com>
Subject Zookeeper data loss scenarios
Date Thu, 05 Jan 2017 22:27:44 GMT
I am trying to get to the bottom of the cause for loss of configurations for Solr cloud stored
in a Zookeeper ensemble. We have been running 4 Solr clouds in our data centers for about
5 years now with no problems. About 2 years ago we started adding more clouds specifically
in AWS.  During those two years, we have had instances where the Solr configurations stored
in Zookeeper have just disappeared. About a year ago we added some new Solr clouds to our
own datacenters and experienced two instances of the Solr configurations disappearing in Zookeeper.
The difference between our original Solr Clouds instances and the ones we have spun up in
the past two years is that we are using Exhibitor for Zookeeper Ensemble management.

We have not been able to find anything in the logs indicating why this problem happens. We
have not been able to replicate the problem reliably. The closest I have come is when adding
new Zookeepers to an ensemble and performing a rolling restart via Exhibitor, there have been
a few instances where pretty much everything stored in Zookeeper has been deleted. Everything
except the Zookeeper information itself. We have asked around on Exhibitor support channels
and done a lot of searching but have come up empty handed in regards to a solution or discovering
other people who have had this issue.

What I suspect is happening is that when rolling restarts happen, if the node that becomes
the leader is a new node that has not had the data replicated to it, when new nodes join to
this leader, they see the leader is without the data they have stored and thus they should
delete said data. In the cases where we are not adding new nodes, I suspect that there might
an issue causing the zookeeper node to fail or appear failed to Exhibitor. A rolling restart
occurs to remove this node. When exhibitor registers the zookeeper is available, Exhibitor
initiates a rolling restart to bring the node back in. For some reason the data is corrupted
or lost on that node and this is the node that becomes the leader. The remaining nodes that
join to this leader then dump their data to match the leader.

Does this scenario sound plausible? If a newly added node that does not have data replicated
to it is added to a zookeeper ensemble and the zookeepers are restarted with the new node
becoming the leader, could this prompt the data stored in Zookeeper to be deleted?

Daniel S Washko
Solutions Architect

Phone: 757 667 1463


  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message