Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
MIME-Version: 1.0
In-Reply-To: <DEE2955E-E6F6-48D3-91C2-8FB382779D0F@gmti.gannett.com>
References: <DEE2955E-E6F6-48D3-91C2-8FB382779D0F@gmti.gannett.com>
From: Michael Han <hanm@cloudera.com>
Date: Thu, 5 Jan 2017 17:14:33 -0800
Message-ID: <CA+i0x1+QVFK2FgO_ttYfH8D8fb-jQ-r5kDxF_SxwFnmGPfKrdA@mail.gmail.com>
Subject: Re: Zookeeper data loss scenarios
To: UserZooKeeper <user@zookeeper.apache.org>
Content-Type: multipart/alternative; boundary=f403045e25a467ff4f054562bfd4
archived-at: Fri, 06 Jan 2017 01:15:14 -0000

--f403045e25a467ff4f054562bfd4
Content-Type: text/plain; charset=UTF-8

I suspect that you might hit ZOOKEEPER-2325
<https://issues.apache.org/jira/browse/ZOOKEEPER-2325> / ZOOKEEPER-261
<https://issues.apache.org/jira/browse/ZOOKEEPER-261> which could possible
cause data loss. Consider this case - we have A, B, C servers but for some
reasons A and B got replaced by Exhibitor with empty data directory. Then C
is down (or C has slower response) so either A or B gets elected as leader
then when C reaches out leader it would truncates its own data. This is an
extreme case (complete data loss), but it sounds possible.

Do we have Exhibitor logs on what Exhibitors did - as you mentioned prior
to Exhibitor things running fine, so it could be what Exhibitor did that
cause this - such as reinitialize server / purge data directory.

On Thu, Jan 5, 2017 at 2:27 PM, Washko, Daniel <dwashko@gannett.com> wrote:

> I am trying to get to the bottom of the cause for loss of configurations
> for Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr
> clouds in our data centers for about 5 years now with no problems. About 2
> years ago we started adding more clouds specifically in AWS.  During those
> two years, we have had instances where the Solr configurations stored in
> Zookeeper have just disappeared. About a year ago we added some new Solr
> clouds to our own datacenters and experienced two instances of the Solr
> configurations disappearing in Zookeeper. The difference between our
> original Solr Clouds instances and the ones we have spun up in the past two
> years is that we are using Exhibitor for Zookeeper Ensemble management.
>
>
>
> We have not been able to find anything in the logs indicating why this
> problem happens. We have not been able to replicate the problem reliably.
> The closest I have come is when adding new Zookeepers to an ensemble and
> performing a rolling restart via Exhibitor, there have been a few instances
> where pretty much everything stored in Zookeeper has been deleted.
> Everything except the Zookeeper information itself. We have asked around on
> Exhibitor support channels and done a lot of searching but have come up
> empty handed in regards to a solution or discovering other people who have
> had this issue.
>
>
>
> What I suspect is happening is that when rolling restarts happen, if the
> node that becomes the leader is a new node that has not had the data
> replicated to it, when new nodes join to this leader, they see the leader
> is without the data they have stored and thus they should delete said data.
> In the cases where we are not adding new nodes, I suspect that there might
> an issue causing the zookeeper node to fail or appear failed to Exhibitor.
> A rolling restart occurs to remove this node. When exhibitor registers the
> zookeeper is available, Exhibitor initiates a rolling restart to bring the
> node back in. For some reason the data is corrupted or lost on that node
> and this is the node that becomes the leader. The remaining nodes that join
> to this leader then dump their data to match the leader.
>
>
>
> Does this scenario sound plausible? If a newly added node that does not
> have data replicated to it is added to a zookeeper ensemble and the
> zookeepers are restarted with the new node becoming the leader, could this
> prompt the data stored in Zookeeper to be deleted?
>
>
>
>
>
> --
>
> *Daniel S Washko*
>
> Solutions Architect
>
>
>
> Phone: 757 667 1463 <(757)%20667-1463>
> dwashko@gannett.com
>
> gannett.com <http://www.gannett.com/>
>
>
>


-- 
Cheers
Michael.

--f403045e25a467ff4f054562bfd4--