Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 80266 invoked from network); 6 Jul 2010 23:58:52 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Jul 2010 23:58:52 -0000 Received: (qmail 92142 invoked by uid 500); 6 Jul 2010 23:58:52 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 92090 invoked by uid 500); 6 Jul 2010 23:58:51 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 92077 invoked by uid 99); 6 Jul 2010 23:58:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 23:58:51 +0000 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of traviscrawford@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2010 23:58:45 +0000 Received: by vws10 with SMTP id 10so46856vws.35 for ; Tue, 06 Jul 2010 16:58:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=sfi/Xd5Mty+9APAHKAaVV192retYBxl5RqQ+465Q7Jg=; b=F7Cxk5SPKTkY214aAcjAUnsNQgfVEoqBApInFkQpmlWmoB2BROzwaZud2wBzgzv8jV 0uhfeeVBVepEgqIjolqSlBXk96vvuTrI4w2sB9BHO0T5vIAsR9RqE/MM2WTJ9y9ZvxFX ovYY5KV5CWEyZphknO4EjB7YicTwKfAbBk0Zg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=rrlGJi2kn8ESZj7eBBJ4Mp9BVrYvCBME1UunM1rCkV77ax83Ebqe36p/kJnnU5orZW k8w6qCQ2svw/mlv5hBpmaTIGLK9Gt/fsAENapkfL48rs+8zTki8bX5Oe/P5Ix5nCkTKF 0aUOCG4FQziZjgx3hxQF/H3YX6a9VH/aozzf0= MIME-Version: 1.0 Received: by 10.229.246.135 with SMTP id ly7mr3281537qcb.269.1278460703687; Tue, 06 Jul 2010 16:58:23 -0700 (PDT) Received: by 10.229.245.15 with HTTP; Tue, 6 Jul 2010 16:58:23 -0700 (PDT) In-Reply-To: References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com> <4C06A7A7.6010609@apache.org> <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com> <4C06E689.9040909@yahoo-inc.com> <0D07AD74-BFF4-4481-BCFF-F2F6F24DDA57@shopkick.com> <4C194D26.9030105@apache.org> Date: Tue, 6 Jul 2010 16:58:23 -0700 Message-ID: Subject: Re: zookeeper crash From: Travis Crawford To: zookeeper-user@hadoop.apache.org Cc: Vishal K Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hey all - I believe we just suffered an outage from this issue. Short version is while restarting quorum members with GC flags recommended in the Troubleshooting wiki page a follower logged messages similar two the following jiras: 2010-07-06 23:14:01,438 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch 20 is less than our epoch 21 2010-07-06 23:14:01,438 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower https://issues.apache.org/jira/browse/ZOOKEEPER-335 https://issues.apache.org/jira/browse/ZOOKEEPER-790 Reading through the jira's its unclear if the issue is well understood at this point (as there's a patch available) or still being understood. If its still being understood let me know and I can attach the relevant log lines to the appropriate jira. Or if the patch appears good I can make a new release and help test. Let me know :) --travis On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira wrote= : > I would recommend opening a separate jira issue. I'm not convinced the > issues are the same, so I'd rather keep them separate and link the issues= if > it is the case. > > -Flavio > > On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote: > >> We are unable to reproduce this issue. If you can provide the server >> logs (all servers) and attach them to the jira it would be very helpful. >> Some detail on the approx time of the issue so we can correlate to the >> logs would help too (summary of what you did/do to cause it, etc... >> anything that might help us nail this one down). >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-335 >> >> Some detail on ZK version, OS, Java version, HW info, etc... would also >> be of use to us. >> >> Patrick >> >> On 06/16/2010 02:49 PM, Vishal K wrote: >>> >>> Hi, >>> >>> We are running into this bug very often (almost 60-75% hit rate) while >>> testing our newly developed application over ZK. This is almost a block= er >>> for us. Will the fix be simplified if backward compatibility was not an >>> issue? >>> >>> Considering that this bug is rarely reported, I am wondering why we are >>> running into this problem so often. Also, on a side note, I am curious >>> why >>> the systest that comes with ZooKeeper did not detect this bug. Can anyo= ne >>> please give an overview of the problem? >>> >>> Thanks. >>> -Vishal >>> >>> >>> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors >>> =A0wrote: >>> >>>> Sure thing. >>>> >>>> We got paged this morning because backend services were not able to >>>> write >>>> to the database. =A0Each server discovers the DB master using zookeepe= r, >>>> so >>>> when zookeeper goes down, they assume they no longer know who the DB >>>> master >>>> is and stop working. >>>> >>>> When we realized there were no problems with the database, we logged i= n >>>> to >>>> the zookeeper nodes. =A0We weren't able to connect to zookeeper using >>>> zkCli.sh >>>> from any of the three nodes, so we decided to restart them all, starti= ng >>>> with node one. =A0However, after restarting node one, the cluster star= ted >>>> responding normally again. >>>> >>>> (The timestamps on the zookeeper processes on nodes two and three *are= * >>>> dated today, but none of us restarted them. =A0We checked shell histor= ies >>>> and >>>> sudo logs, and they seem to back us up.) >>>> >>>> We tried getting node one to come back up and join the cluster, but >>>> that's >>>> when we realized we weren't getting any logs, because log4j.properties >>>> was >>>> in the wrong location. =A0Sorry -- I REALLY wish I had those logs for = you. >>>> =A0We >>>> put log4j back in place, and that's when we saw the spew I pasted in m= y >>>> first message. >>>> >>>> I'll tack this on to ZK-335. >>>> >>>> >>>> >>>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote: >>>> >>>>> charity, do you mind going through your scenario again to give a >>>>> timeline for the failure? i'm a bit confused as to what happened. >>>>> >>>>> ben >>>>> >>>>> On 06/02/2010 01:32 PM, Charity Majors wrote: >>>>>> >>>>>> Thanks. =A0That worked for me. =A0I'm a little confused about why it= threw >>>> >>>> the entire cluster into an unusable state, though. >>>>>> >>>>>> I said before that we restarted all three nodes, but tracing back, w= e >>>> >>>> actually didn't. =A0The zookeeper cluster was refusing all connections >>>> until >>>> we restarted node one. =A0But once node one had been dropped from the >>>> cluster, >>>> the other two nodes formed a quorum and started responding to queries = on >>>> their own. >>>>>> >>>>>> Is that expected as well? =A0I didn't see it in ZOOKEEPER-335, so >>>>>> thought >>>> >>>> I'd mention it. >>>>>> >>>>>> >>>>>> >>>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote: >>>>>> >>>>>> >>>>>>> Hi Charity, unfortunately this is a known issue not specific to 3.3 >>>> >>>> that >>>>>>> >>>>>>> we are working to address. See this thread for some background: >>>>>>> >>>>>>> >>>> >>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761= .html >>>>>>> >>>>>>> I've raised the JIRA level to "blocker" to ensure we address this >>>>>>> asap. >>>>>>> >>>>>>> As Ted suggested you can remove the datadir -- only on the effected >>>>>>> server -- and then restart it. That should resolve the issue (the >>>> >>>> server >>>>>>> >>>>>>> will d/l a snapshot of the current db from the leader). >>>>>>> >>>>>>> Patrick >>>>>>> >>>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote: >>>>>>> >>>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in = an >>>> >>>> attempt to get away from a client bug that was crashing my backend >>>> services. >>>>>>>> >>>>>>>> Unfortunately, this morning I had a server crash, and it brought >>>>>>>> down >>>> >>>> my entire cluster. =A0I don't have the logs leading up to the crash, >>>> because >>>> -- argghffbuggle -- log4j wasn't set up correctly. =A0But I restarted = all >>>> three nodes, and odes two and three came back up and formed a quorum. >>>>>>>> >>>>>>>> Node one, meanwhile, does this: >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,446 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,446 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot >>>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,476 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New >>>> election. >>>> My id =3D =A01, Proposed zxid =3D 47244640287 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - >>>> Notification: >>>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - >>>> Notification: >>>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - >>>> Notification: >>>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created se= rver >>>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 >>>> datadir >>>> /services/zookeeper/data/zookeeper/version-2 snapdir >>>> /services/zookeeper/data/zookeeper/version-2 >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - FATAL >>>> >>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is les= s >>>> than our epoch b >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - WARN >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when >>>> following >>>> the leader >>>>>>>> >>>>>>>> java.io.IOException: Error: Epoch of leader is lower >>>>>>>> =A0 =A0 =A0 at >>>> >>>> >>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java= :73) >>>>>>>> >>>>>>>> =A0 =A0 =A0 at >>>> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) >>>>>>>> >>>>>>>> 2010-06-02 17:04:56,486 - INFO >>>> >>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called >>>>>>>> >>>>>>>> java.lang.Exception: shutdown Follower >>>>>>>> =A0 =A0 =A0 at >>>> >>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166= ) >>>>>>>> >>>>>>>> =A0 =A0 =A0 at >>>> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> All I can find is this, >>>> >>>> >>>> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg004= 49.html, >>>> which implies that this state should never happen. >>>>>>>> >>>>>>>> Any suggestions? =A0If it happens again, I'll just have to roll >>>> >>>> everything back to 3.2.1 and live with the client crashes. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>>> >>> > >