Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 59762 invoked from network); 2 Jun 2010 20:55:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Jun 2010 20:55:12 -0000 Received: (qmail 53107 invoked by uid 500); 2 Jun 2010 20:55:12 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 53079 invoked by uid 500); 2 Jun 2010 20:55:12 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 53067 invoked by uid 99); 2 Jun 2010 20:55:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 20:55:12 +0000 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=SPF_NEUTRAL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2010 20:55:06 +0000 Received: from [192.168.1.101] (vpn-client-20-43.corp.ukl.yahoo.com [10.76.20.43]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o52Ks4tI004365; Wed, 2 Jun 2010 13:54:04 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=cc:message-id:from:to:in-reply-to:content-type: content-transfer-encoding:mime-version:subject:date:references:x-mailer; b=Wat3XyQimw5POZfOOsRjCcmowaAlwI92r+tc6x/2Dqwze3Aa5zJdSZV+88RKmA4L Cc: Patrick Hunt Message-Id: From: Flavio Junqueira To: "zookeeper-user@hadoop.apache.org" In-Reply-To: <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v936) Subject: Re: zookeeper crash Date: Wed, 2 Jun 2010 22:54:03 +0200 References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com> <4C06A7A7.6010609@apache.org> <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com> X-Mailer: Apple Mail (2.936) X-Virus-Checked: Checked by ClamAV on apache.org Hi Charity, This is certainly not expected. It would be very useful if you could provide us with as much information about your issue as possible. I would suggest that either you create a new jira and link it to ZOOKEEPER-335, or that you add to 335 directly. We'll be looking further into why you have seen this problem and working on a fix. Cheers, -Flavio On Jun 2, 2010, at 10:32 PM, Charity Majors wrote: > Thanks. That worked for me. I'm a little confused about why it > threw the entire cluster into an unusable state, though. > > I said before that we restarted all three nodes, but tracing back, > we actually didn't. The zookeeper cluster was refusing all > connections until we restarted node one. But once node one had been > dropped from the cluster, the other two nodes formed a quorum and > started responding to queries on their own. > > Is that expected as well? I didn't see it in ZOOKEEPER-335, so > thought I'd mention it. > > > > On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: > >> Hi Charity, unfortunately this is a known issue not specific to 3.3 >> that >> we are working to address. See this thread for some background: >> >> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html >> >> I've raised the JIRA level to "blocker" to ensure we address this >> asap. >> >> As Ted suggested you can remove the datadir -- only on the effected >> server -- and then restart it. That should resolve the issue (the >> server >> will d/l a snapshot of the current db from the leader). >> >> Patrick >> >> On 06/02/2010 11:11 AM, Charity Majors wrote: >>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in >>> an attempt to get away from a client bug that was crashing my >>> backend services. >>> >>> Unfortunately, this morning I had a server crash, and it brought >>> down my entire cluster. I don't have the logs leading up to the >>> crash, because -- argghffbuggle -- log4j wasn't set up correctly. >>> But I restarted all three nodes, and odes two and three came back >>> up and formed a quorum. >>> >>> Node one, meanwhile, does this: >>> >>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING >>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /services/ >>> zookeeper/data/zookeeper/version-2/snapshot.a00000045 >>> 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. My id >>> = 1, Proposed zxid = 47244640287 >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: 1, >>> 47244640287, 4, 1, LOOKING, LOOKING, 1 >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3, >>> 38654707048, 3, 1, LOOKING, LEADING, 3 >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3, >>> 38654707048, 3, 1, LOOKING, FOLLOWING, 2 >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with >>> tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 >>> datadir /services/zookeeper/data/zookeeper/version-2 snapdir / >>> services/zookeeper/data/zookeeper/version-2 >>> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less than >>> our epoch b >>> 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following the >>> leader >>> java.io.IOException: Error: Epoch of leader is lower >>> at >>> org >>> .apache >>> .zookeeper.server.quorum.Follower.followLeader(Follower.java:73) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: >>> 644) >>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/ >>> 0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called >>> java.lang.Exception: shutdown Follower >>> at >>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java: >>> 166) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: >>> 648) >>> >>> >>> >>> All I can find is this, http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html >>> , which implies that this state should never happen. >>> >>> Any suggestions? If it happens again, I'll just have to roll >>> everything back to 3.2.1 and live with the client crashes. >>> >>> >>> >>> >