Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=cc:message-id:from:to:in-reply-to:content-type:
	content-transfer-encoding:mime-version:subject:date:references:x-mailer;
	b=Wat3XyQimw5POZfOOsRjCcmowaAlwI92r+tc6x/2Dqwze3Aa5zJdSZV+88RKmA4L
Cc: Patrick Hunt <phunt@apache.org>
Message-Id: <DF0E7C25-866F-45CB-B7E6-29402C4F759B@yahoo-inc.com>
From: Flavio Junqueira <fpj@yahoo-inc.com>
To: "zookeeper-user@hadoop.apache.org" <zookeeper-user@hadoop.apache.org>
In-Reply-To: <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Subject: Re: zookeeper crash
Date: Wed, 2 Jun 2010 22:54:03 +0200
References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com>
 <4C06A7A7.6010609@apache.org>
 <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com>

Hi Charity, This is certainly not expected. It would be very useful if  
you could provide us with as much information about your issue as  
possible. I would suggest that either you create a new jira and link  
it to ZOOKEEPER-335, or that you add to 335 directly.

We'll be looking further into why you have seen this problem and  
working on a fix.

Cheers,
-Flavio

On Jun 2, 2010, at 10:32 PM, Charity Majors wrote:

> Thanks.  That worked for me.  I'm a little confused about why it  
> threw the entire cluster into an unusable state, though.
>
> I said before that we restarted all three nodes, but tracing back,  
> we actually didn't.  The zookeeper cluster was refusing all  
> connections until we restarted node one.  But once node one had been  
> dropped from the cluster, the other two nodes formed a quorum and  
> started responding to queries on their own.
>
> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so  
> thought I'd mention it.
>
>
>
> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>
>> Hi Charity, unfortunately this is a known issue not specific to 3.3  
>> that
>> we are working to address. See this thread for some background:
>>
>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>
>> I've raised the JIRA level to "blocker" to ensure we address this  
>> asap.
>>
>> As Ted suggested you can remove the datadir -- only on the effected
>> server -- and then restart it. That should resolve the issue (the  
>> server
>> will d/l a snapshot of the current db from the leader).
>>
>> Patrick
>>
>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in  
>>> an attempt to get away from a client bug that was crashing my  
>>> backend services.
>>>
>>> Unfortunately, this morning I had a server crash, and it brought  
>>> down my entire cluster.  I don't have the logs leading up to the  
>>> crash, because -- argghffbuggle -- log4j wasn't set up correctly.   
>>> But I restarted all three nodes, and odes two and three came back  
>>> up and formed a quorum.
>>>
>>> Node one, meanwhile, does this:
>>>
>>> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /services/ 
>>> zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>> 2010-06-02 17:04:56,476 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. My id  
>>> =  1, Proposed zxid = 47244640287
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: 1,  
>>> 47244640287, 4, 1, LOOKING, LOOKING, 1
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3,  
>>> 38654707048, 3, 1, LOOKING, LEADING, 3
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3,  
>>> 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with  
>>> tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000  
>>> datadir /services/zookeeper/data/zookeeper/version-2 snapdir / 
>>> services/zookeeper/data/zookeeper/version-2
>>> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less than  
>>> our epoch b
>>> 2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following the  
>>> leader
>>> java.io.IOException: Error: Epoch of leader is lower
>>>       at  
>>> org 
>>> .apache 
>>> .zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>>       at  
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
>>> 644)
>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
>>> 0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>> java.lang.Exception: shutdown Follower
>>>       at  
>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java: 
>>> 166)
>>>       at  
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
>>> 648)
>>>
>>>
>>>
>>> All I can find is this, http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html 
>>> , which implies that this state should never happen.
>>>
>>> Any suggestions?  If it happens again, I'll just have to roll  
>>> everything back to 3.2.1 and live with the client crashes.
>>>
>>>
>>>
>>>
>