Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 23204 invoked from network); 3 Jun 2010 00:18:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Jun 2010 00:18:04 -0000 Received: (qmail 67293 invoked by uid 500); 3 Jun 2010 00:18:04 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 67112 invoked by uid 500); 3 Jun 2010 00:18:04 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 67104 invoked by uid 99); 3 Jun 2010 00:18:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 00:18:04 +0000 X-ASF-Spam-Status: No, hits=3.6 required=10.0 tests=AWL,RCVD_ILLEGAL_IP,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.225.164.29] (HELO EXHUB020-2.exch020.serverdata.net) (206.225.164.29) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 00:18:00 +0000 Received: from EXVMBX020-11.exch020.serverdata.net ([169.254.2.138]) by EXHUB020-2.exch020.serverdata.net ([206.225.164.29]) with mapi; Wed, 2 Jun 2010 17:17:37 -0700 From: Charity Majors To: "zookeeper-user@hadoop.apache.org" Date: Wed, 2 Jun 2010 17:17:37 -0700 Subject: Re: zookeeper crash Thread-Topic: zookeeper crash Thread-Index: AcsCsizHZO4TYGxaTcG82JI+7IUMBA== Message-ID: <0D07AD74-BFF4-4481-BCFF-F2F6F24DDA57@shopkick.com> References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com> <4C06A7A7.6010609@apache.org> <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com> <4C06E689.9040909@yahoo-inc.com> In-Reply-To: <4C06E689.9040909@yahoo-inc.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sure thing. We got paged this morning because backend services were not able to write t= o the database. Each server discovers the DB master using zookeeper, so wh= en zookeeper goes down, they assume they no longer know who the DB master i= s and stop working. When we realized there were no problems with the database, we logged in to = the zookeeper nodes. We weren't able to connect to zookeeper using zkCli.s= h from any of the three nodes, so we decided to restart them all, starting = with node one. However, after restarting node one, the cluster started res= ponding normally again.=20 (The timestamps on the zookeeper processes on nodes two and three *are* dat= ed today, but none of us restarted them. We checked shell histories and su= do logs, and they seem to back us up.) We tried getting node one to come back up and join the cluster, but that's = when we realized we weren't getting any logs, because log4j.properties was = in the wrong location. Sorry -- I REALLY wish I had those logs for you. W= e put log4j back in place, and that's when we saw the spew I pasted in my f= irst message. I'll tack this on to ZK-335. On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: > charity, do you mind going through your scenario again to give a=20 > timeline for the failure? i'm a bit confused as to what happened. >=20 > ben >=20 > On 06/02/2010 01:32 PM, Charity Majors wrote: >> Thanks. That worked for me. I'm a little confused about why it threw t= he entire cluster into an unusable state, though. >>=20 >> I said before that we restarted all three nodes, but tracing back, we ac= tually didn't. The zookeeper cluster was refusing all connections until we= restarted node one. But once node one had been dropped from the cluster, = the other two nodes formed a quorum and started responding to queries on th= eir own. >>=20 >> Is that expected as well? I didn't see it in ZOOKEEPER-335, so thought = I'd mention it. >>=20 >>=20 >>=20 >> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: >>=20 >>=20 >>> Hi Charity, unfortunately this is a known issue not specific to 3.3 tha= t >>> we are working to address. See this thread for some background: >>>=20 >>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.= html >>>=20 >>> I've raised the JIRA level to "blocker" to ensure we address this asap. >>>=20 >>> As Ted suggested you can remove the datadir -- only on the effected >>> server -- and then restart it. That should resolve the issue (the serve= r >>> will d/l a snapshot of the current db from the leader). >>>=20 >>> Patrick >>>=20 >>> On 06/02/2010 11:11 AM, Charity Majors wrote: >>>=20 >>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an a= ttempt to get away from a client bug that was crashing my backend services. >>>>=20 >>>> Unfortunately, this morning I had a server crash, and it brought down = my entire cluster. I don't have the logs leading up to the crash, because = -- argghffbuggle -- log4j wasn't set up correctly. But I restarted all thr= ee nodes, and odes two and three came back up and formed a quorum. >>>>=20 >>>> Node one, meanwhile, does this: >>>>=20 >>>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Quor= umPeer@620] - LOOKING >>>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:File= Snap@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/sn= apshot.a00000045 >>>> 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast= LeaderElection@649] - New election. My id =3D 1, Proposed zxid =3D 4724464= 0287 >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast= LeaderElection@689] - Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING,= 1 >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast= LeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, LEADING,= 3 >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast= LeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWIN= G, 2 >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Quor= umPeer@642] - FOLLOWING >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooK= eeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000= maxSessionTimeout 40000 datadir /services/zookeeper/data/zookeeper/version= -2 snapdir /services/zookeeper/data/zookeeper/version-2 >>>> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll= ower@71] - Leader epoch a is less than our epoch b >>>> 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll= ower@82] - Exception when following the leader >>>> java.io.IOException: Error: Epoch of leader is lower >>>> at org.apache.zookeeper.server.quorum.Follower.followLeader(Fol= lower.java:73) >>>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer= .java:644) >>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll= ower@166] - shutdown called >>>> java.lang.Exception: shutdown Follower >>>> at org.apache.zookeeper.server.quorum.Follower.shutdown(Followe= r.java:166) >>>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer= .java:648) >>>>=20 >>>>=20 >>>>=20 >>>> All I can find is this, http://www.mail-archive.com/zookeeper-commits@= hadoop.apache.org/msg00449.html, which implies that this state should never= happen. >>>>=20 >>>> Any suggestions? If it happens again, I'll just have to roll everythi= ng back to 3.2.1 and live with the client crashes. >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>=20 >=20