Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 21832 invoked from network); 2 Jun 2010 18:49:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Jun 2010 18:49:33 -0000 Received: (qmail 83712 invoked by uid 500); 2 Jun 2010 18:49:33 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 83668 invoked by uid 500); 2 Jun 2010 18:49:33 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 83655 invoked by uid 99); 2 Jun 2010 18:49:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 18:49:33 +0000 X-ASF-Spam-Status: No, hits=-195.1 required=10.0 tests=AWL,URI_HEX X-Spam-Check-By: apache.org Received: from [140.211.11.9] (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 02 Jun 2010 18:49:32 +0000 Received: (qmail 21791 invoked by uid 99); 2 Jun 2010 18:49:12 -0000 Received: from localhost.apache.org (HELO [192.168.1.126]) (127.0.0.1) (smtp-auth username phunt, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 18:49:12 +0000 Message-ID: <4C06A7A7.6010609@apache.org> Date: Wed, 02 Jun 2010 11:49:11 -0700 From: Patrick Hunt User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.1.9) Gecko/20100317 Lightning/1.0b1 Thunderbird/3.0.4 MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org CC: Charity Majors Subject: Re: zookeeper crash References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com> In-Reply-To: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA level to "blocker" to ensure we address this asap. As Ted suggested you can remove the datadir -- only on the effected server -- and then restart it. That should resolve the issue (the server will d/l a snapshot of the current db from the leader). Patrick On 06/02/2010 11:11 AM, Charity Majors wrote: > I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to get away from a client bug that was crashing my backend services. > > Unfortunately, this morning I had a server crash, and it brought down my entire cluster. I don't have the logs leading up to the crash, because -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all three nodes, and odes two and three came back up and formed a quorum. > > Node one, meanwhile, does this: > > 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING > 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045 > 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. My id = 1, Proposed zxid = 47244640287 > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, LEADING, 3 > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /services/zookeeper/data/zookeeper/version-2 snapdir /services/zookeeper/data/zookeeper/version-2 > 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less than our epoch b > 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following the leader > java.io.IOException: Error: Epoch of leader is lower > at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) > 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called > java.lang.Exception: shutdown Follower > at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) > > > > All I can find is this, http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html, which implies that this state should never happen. > > Any suggestions? If it happens again, I'll just have to roll everything back to 3.2.1 and live with the client crashes. > > > >