Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 03B37D650 for ; Wed, 5 Sep 2012 19:41:47 +0000 (UTC) Received: (qmail 10365 invoked by uid 500); 5 Sep 2012 19:41:46 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 10332 invoked by uid 500); 5 Sep 2012 19:41:46 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 10322 invoked by uid 99); 5 Sep 2012 19:41:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2012 19:41:46 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cf@renttherunway.com designates 209.85.220.170 as permitted sender) Received: from [209.85.220.170] (HELO mail-vc0-f170.google.com) (209.85.220.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2012 19:41:39 +0000 Received: by vchn11 with SMTP id n11so1220140vch.15 for ; Wed, 05 Sep 2012 12:41:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :x-gm-message-state; bh=QIgSUigaXdoPcefkiGVYIVQQD6KtunGF9PoMKMIDWg4=; b=YhvNeGbZznxx0Si3fPVkbglreGs+yiP6jdKHBkO+tJr6abFe/LVNyPNqnffGWLeyRE IDzyV1LmsgjZIiJU3jE2Ghq1tu4zz2D3RX3jG/76ObxDxZY2C8Jb8k7H0BR72f/FxWbB i+5t2jMBvauum1zTO4x4o3wN8D66WEGekEv84HHCQbQ7o2Z4ccZzZi0rw1XHvzYNEUnc mCtPM6zgovvkf41cBBGi1lJar+ISMUXYsZNVSUeeOsQSr2b5d5awebNT7EIRHKdpENW2 bJVATqpaxmg6+WyVsi8DNNJHuNrdg/sAR950dKIat9vGpbZWXRg5fqmog35WQqd1evfC QOrA== MIME-Version: 1.0 Received: by 10.52.67.143 with SMTP id n15mr16718416vdt.34.1346874078270; Wed, 05 Sep 2012 12:41:18 -0700 (PDT) Sender: cf@renttherunway.com Received: by 10.58.91.140 with HTTP; Wed, 5 Sep 2012 12:41:18 -0700 (PDT) In-Reply-To: References: Date: Wed, 5 Sep 2012 15:41:18 -0400 X-Google-Sender-Auth: 2Rge2el7NCokFFfVNJFiPEVUu24 Message-ID: Subject: Re: ZooKeeper Cluster Crash resulted in not loadable database From: Camille Fournier To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=20cf307f334efe15b704c8f98e42 X-Gm-Message-State: ALoCoQn01zJp1IqGit4f73KnFJMkRmQFLqJwdyi7L2UoCkEKn7wCilBypMsCbbyySXECEYIEQdH6 X-Virus-Checked: Checked by ClamAV on apache.org --20cf307f334efe15b704c8f98e42 Content-Type: text/plain; charset=ISO-8859-1 You can try running them through org.apache.zookeeper.server.LogFormatter and see what comes out. That's where I would start. C On Wed, Sep 5, 2012 at 3:43 AM, Gunnar Wagenknecht wrote: > Hi, > > I'm investigating a crash of a ZooKeeper 3.3.4 cluster. It seems that > the cause of the crash was an issue in the networking layer. All the ZK > server suddenly lost connections to clients as well as all between > themselves. Only a few seconds later, all ZooKeeper servers had issues > loading their database because of the following exception. > > ERROR [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@224] > Failed to increment parent cversion for: /a/b/c > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /a/b/c > at DataTree.incrementCversion(DataTree.java:1218) > at FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:222) > at FileTxnSnapLog.restore(FileTxnSnapLog.java:150) > at ZKDatabase.loadDataBase(ZKDatabase.java:222) > at QuorumPeer.getLastLoggedZxid(QuorumPeer.java:493) > at FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:632) > at FastLeaderElection.lookForLeader(FastLeaderElection.java:660) > at QuorumPeer.run(QuorumPeer.java:622) > > WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@497] > Unable to load database > > Note that the path "/a/b/c" was different on all servers. Thus, each > server tried to restore a different transaction. > > The only way I was able to bring the cluster back online was to delete > all the transaction logs on all servers and start with the latest snapshot. > > I have all the logs and snapshots available for investigation. Are there > any tools to help an investigation? I'd like to find out how such a > network outage could possibly cause such an inconsistent/instable state > in the system. I noticed a few stability fixes in 3.3.5/3.3.6. Thus, an > upgrade is already scheduled. > > Any help is appreciated. > > -Gunnar > > > > -- > Gunnar Wagenknecht > gunnar@wagenknecht.org > http://wagenknecht.org/ > > --20cf307f334efe15b704c8f98e42--