zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@cloudera.com>
Subject Re: Zookeeper data loss scenarios
Date Fri, 06 Jan 2017 01:14:33 GMT
I suspect that you might hit ZOOKEEPER-2325
<https://issues.apache.org/jira/browse/ZOOKEEPER-2325> / ZOOKEEPER-261
<https://issues.apache.org/jira/browse/ZOOKEEPER-261> which could possible
cause data loss. Consider this case - we have A, B, C servers but for some
reasons A and B got replaced by Exhibitor with empty data directory. Then C
is down (or C has slower response) so either A or B gets elected as leader
then when C reaches out leader it would truncates its own data. This is an
extreme case (complete data loss), but it sounds possible.

Do we have Exhibitor logs on what Exhibitors did - as you mentioned prior
to Exhibitor things running fine, so it could be what Exhibitor did that
cause this - such as reinitialize server / purge data directory.

On Thu, Jan 5, 2017 at 2:27 PM, Washko, Daniel <dwashko@gannett.com> wrote:

> I am trying to get to the bottom of the cause for loss of configurations
> for Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr
> clouds in our data centers for about 5 years now with no problems. About 2
> years ago we started adding more clouds specifically in AWS.  During those
> two years, we have had instances where the Solr configurations stored in
> Zookeeper have just disappeared. About a year ago we added some new Solr
> clouds to our own datacenters and experienced two instances of the Solr
> configurations disappearing in Zookeeper. The difference between our
> original Solr Clouds instances and the ones we have spun up in the past two
> years is that we are using Exhibitor for Zookeeper Ensemble management.
> We have not been able to find anything in the logs indicating why this
> problem happens. We have not been able to replicate the problem reliably.
> The closest I have come is when adding new Zookeepers to an ensemble and
> performing a rolling restart via Exhibitor, there have been a few instances
> where pretty much everything stored in Zookeeper has been deleted.
> Everything except the Zookeeper information itself. We have asked around on
> Exhibitor support channels and done a lot of searching but have come up
> empty handed in regards to a solution or discovering other people who have
> had this issue.
> What I suspect is happening is that when rolling restarts happen, if the
> node that becomes the leader is a new node that has not had the data
> replicated to it, when new nodes join to this leader, they see the leader
> is without the data they have stored and thus they should delete said data.
> In the cases where we are not adding new nodes, I suspect that there might
> an issue causing the zookeeper node to fail or appear failed to Exhibitor.
> A rolling restart occurs to remove this node. When exhibitor registers the
> zookeeper is available, Exhibitor initiates a rolling restart to bring the
> node back in. For some reason the data is corrupted or lost on that node
> and this is the node that becomes the leader. The remaining nodes that join
> to this leader then dump their data to match the leader.
> Does this scenario sound plausible? If a newly added node that does not
> have data replicated to it is added to a zookeeper ensemble and the
> zookeepers are restarted with the new node becoming the leader, could this
> prompt the data stored in Zookeeper to be deleted?
> --
> *Daniel S Washko*
> Solutions Architect
> Phone: 757 667 1463 <(757)%20667-1463>
> dwashko@gannett.com
> gannett.com <http://www.gannett.com/>


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message