Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 73720200C02 for ; Fri, 6 Jan 2017 02:15:14 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 721E6160B42; Fri, 6 Jan 2017 01:15:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9755C160B33 for ; Fri, 6 Jan 2017 02:15:13 +0100 (CET) Received: (qmail 10381 invoked by uid 500); 6 Jan 2017 01:15:12 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 10369 invoked by uid 99); 6 Jan 2017 01:15:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Jan 2017 01:15:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BD3B9C695D for ; Fri, 6 Jan 2017 01:15:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.479 X-Spam-Level: ** X-Spam-Status: No, score=2.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id YhKh7T7FUuWx for ; Fri, 6 Jan 2017 01:15:10 +0000 (UTC) Received: from mail-ua0-f169.google.com (mail-ua0-f169.google.com [209.85.217.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 5A1735FCB1 for ; Fri, 6 Jan 2017 01:15:10 +0000 (UTC) Received: by mail-ua0-f169.google.com with SMTP id 88so388832368uaq.3 for ; Thu, 05 Jan 2017 17:15:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=m5pEERRS80ZLJnmuO7GlnlFkxoAne9bXKmvbHkWtNDA=; b=0Xpn7mn0dCTz7fSaGp1QmMd5vRLH9Fsj7pLcLMeof4hMMSYZ19E3LMi6G4Mvev+Tc8 gX3u92RbGptQdlKHHt1Ox9u/xt7hkZUqZA9nLUapO+uW8rIjrv2GVzmScQPwjSWbkiDm Fct3V6XCP5Xtk3DkHKuivQdliVfweAJ9Xocouwknz4Bxghl/9Hp1+nWRtAVbx8/6DqKw QT9r7DyLzl32nzj6F13KihFMkC+wi6nGpcWbBGcsxcP6TYsFKOHHCZ/EipN8qPmhFwnG yXqizv5kSLEU468NReb3NdHN9nSGxDiw5aTwzIH0rR++vWV8Prerw+WGp22BNIXV7axi lwsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=m5pEERRS80ZLJnmuO7GlnlFkxoAne9bXKmvbHkWtNDA=; b=Vi5KYLbC5YjfjyayFi252Yh15KY4CXphjLGNEptr+dnB4mKJef3NJl7w4nMy+6GCYW TLYxr76Hhp1msBR/cQ+4psDNgPHFIyTas8+dYuzSRpm38yb7GcH/fzLfZ0kBRbb/aa2q yf/7MQy2Ar0h31+kANz3CZPOoHWBcYSHyqgage42xFxqnfywZ/dr6/ECGXHjdDw/oI0a 6cP+TngNKfBJSoWb9gF68aFN2qS70xWxIY4+Usr7fIwNZHFmwnX1Qua/HiLNZiXxQIni 4T09FpPk00ccpNGop9jsZNFxZfcfwvdDfh2dyGhbdN6DtlCvX006GdY7qc1O7jM2V7ib yFTw== X-Gm-Message-State: AIkVDXLnpOZ0lkPL8gzH4BwsMgKSSIY+nW9y7QzE0b1RN/mip9PRVn7lrIaHV+pBWdaIGQ7ppoy04NU7tzg+0o78 X-Received: by 10.159.52.213 with SMTP id b21mr54006762uac.152.1483665303919; Thu, 05 Jan 2017 17:15:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.176.3.5 with HTTP; Thu, 5 Jan 2017 17:14:33 -0800 (PST) In-Reply-To: References: From: Michael Han Date: Thu, 5 Jan 2017 17:14:33 -0800 Message-ID: Subject: Re: Zookeeper data loss scenarios To: UserZooKeeper Content-Type: multipart/alternative; boundary=f403045e25a467ff4f054562bfd4 archived-at: Fri, 06 Jan 2017 01:15:14 -0000 --f403045e25a467ff4f054562bfd4 Content-Type: text/plain; charset=UTF-8 I suspect that you might hit ZOOKEEPER-2325 / ZOOKEEPER-261 which could possible cause data loss. Consider this case - we have A, B, C servers but for some reasons A and B got replaced by Exhibitor with empty data directory. Then C is down (or C has slower response) so either A or B gets elected as leader then when C reaches out leader it would truncates its own data. This is an extreme case (complete data loss), but it sounds possible. Do we have Exhibitor logs on what Exhibitors did - as you mentioned prior to Exhibitor things running fine, so it could be what Exhibitor did that cause this - such as reinitialize server / purge data directory. On Thu, Jan 5, 2017 at 2:27 PM, Washko, Daniel wrote: > I am trying to get to the bottom of the cause for loss of configurations > for Solr cloud stored in a Zookeeper ensemble. We have been running 4 Solr > clouds in our data centers for about 5 years now with no problems. About 2 > years ago we started adding more clouds specifically in AWS. During those > two years, we have had instances where the Solr configurations stored in > Zookeeper have just disappeared. About a year ago we added some new Solr > clouds to our own datacenters and experienced two instances of the Solr > configurations disappearing in Zookeeper. The difference between our > original Solr Clouds instances and the ones we have spun up in the past two > years is that we are using Exhibitor for Zookeeper Ensemble management. > > > > We have not been able to find anything in the logs indicating why this > problem happens. We have not been able to replicate the problem reliably. > The closest I have come is when adding new Zookeepers to an ensemble and > performing a rolling restart via Exhibitor, there have been a few instances > where pretty much everything stored in Zookeeper has been deleted. > Everything except the Zookeeper information itself. We have asked around on > Exhibitor support channels and done a lot of searching but have come up > empty handed in regards to a solution or discovering other people who have > had this issue. > > > > What I suspect is happening is that when rolling restarts happen, if the > node that becomes the leader is a new node that has not had the data > replicated to it, when new nodes join to this leader, they see the leader > is without the data they have stored and thus they should delete said data. > In the cases where we are not adding new nodes, I suspect that there might > an issue causing the zookeeper node to fail or appear failed to Exhibitor. > A rolling restart occurs to remove this node. When exhibitor registers the > zookeeper is available, Exhibitor initiates a rolling restart to bring the > node back in. For some reason the data is corrupted or lost on that node > and this is the node that becomes the leader. The remaining nodes that join > to this leader then dump their data to match the leader. > > > > Does this scenario sound plausible? If a newly added node that does not > have data replicated to it is added to a zookeeper ensemble and the > zookeepers are restarted with the new node becoming the leader, could this > prompt the data stored in Zookeeper to be deleted? > > > > > > -- > > *Daniel S Washko* > > Solutions Architect > > > > Phone: 757 667 1463 <(757)%20667-1463> > dwashko@gannett.com > > gannett.com > > > -- Cheers Michael. --f403045e25a467ff4f054562bfd4--