hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: zookeeper crash
Date Wed, 16 Jun 2010 22:25:03 GMT
I would recommend opening a separate jira issue. I'm not convinced the  
issues are the same, so I'd rather keep them separate and link the  
issues if it is the case.

-Flavio

On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:

> We are unable to reproduce this issue. If you can provide the server
> logs (all servers) and attach them to the jira it would be very  
> helpful.
> Some detail on the approx time of the issue so we can correlate to the
> logs would help too (summary of what you did/do to cause it, etc...
> anything that might help us nail this one down).
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-335
>
> Some detail on ZK version, OS, Java version, HW info, etc... would  
> also
> be of use to us.
>
> Patrick
>
> On 06/16/2010 02:49 PM, Vishal K wrote:
>> Hi,
>>
>> We are running into this bug very often (almost 60-75% hit rate)  
>> while
>> testing our newly developed application over ZK. This is almost a  
>> blocker
>> for us. Will the fix be simplified if backward compatibility was  
>> not an
>> issue?
>>
>> Considering that this bug is rarely reported, I am wondering why we  
>> are
>> running into this problem so often. Also, on a side note, I am  
>> curious why
>> the systest that comes with ZooKeeper did not detect this bug. Can  
>> anyone
>> please give an overview of the problem?
>>
>> Thanks.
>> -Vishal
>>
>>
>> On Wed, Jun 2, 2010 at 8:17 PM, Charity  
>> Majors<charity@shopkick.com>  wrote:
>>
>>> Sure thing.
>>>
>>> We got paged this morning because backend services were not able  
>>> to write
>>> to the database.  Each server discovers the DB master using  
>>> zookeeper, so
>>> when zookeeper goes down, they assume they no longer know who the  
>>> DB master
>>> is and stop working.
>>>
>>> When we realized there were no problems with the database, we  
>>> logged in to
>>> the zookeeper nodes.  We weren't able to connect to zookeeper  
>>> using zkCli.sh
>>> from any of the three nodes, so we decided to restart them all,  
>>> starting
>>> with node one.  However, after restarting node one, the cluster  
>>> started
>>> responding normally again.
>>>
>>> (The timestamps on the zookeeper processes on nodes two and three  
>>> *are*
>>> dated today, but none of us restarted them.  We checked shell  
>>> histories and
>>> sudo logs, and they seem to back us up.)
>>>
>>> We tried getting node one to come back up and join the cluster,  
>>> but that's
>>> when we realized we weren't getting any logs, because  
>>> log4j.properties was
>>> in the wrong location.  Sorry -- I REALLY wish I had those logs  
>>> for you.  We
>>> put log4j back in place, and that's when we saw the spew I pasted  
>>> in my
>>> first message.
>>>
>>> I'll tack this on to ZK-335.
>>>
>>>
>>>
>>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>>>
>>>> charity, do you mind going through your scenario again to give a
>>>> timeline for the failure? i'm a bit confused as to what happened.
>>>>
>>>> ben
>>>>
>>>> On 06/02/2010 01:32 PM, Charity Majors wrote:
>>>>> Thanks.  That worked for me.  I'm a little confused about why it  
>>>>> threw
>>> the entire cluster into an unusable state, though.
>>>>>
>>>>> I said before that we restarted all three nodes, but tracing  
>>>>> back, we
>>> actually didn't.  The zookeeper cluster was refusing all  
>>> connections until
>>> we restarted node one.  But once node one had been dropped from  
>>> the cluster,
>>> the other two nodes formed a quorum and started responding to  
>>> queries on
>>> their own.
>>>>>
>>>>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so  
>>>>> thought
>>> I'd mention it.
>>>>>
>>>>>
>>>>>
>>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>>>>
>>>>>
>>>>>> Hi Charity, unfortunately this is a known issue not specific to 

>>>>>> 3.3
>>> that
>>>>>> we are working to address. See this thread for some background:
>>>>>>
>>>>>>
>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>>>>>
>>>>>> I've raised the JIRA level to "blocker" to ensure we address  
>>>>>> this asap.
>>>>>>
>>>>>> As Ted suggested you can remove the datadir -- only on the  
>>>>>> effected
>>>>>> server -- and then restart it. That should resolve the issue (the
>>> server
>>>>>> will d/l a snapshot of the current db from the leader).
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>>>>>
>>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1,
 
>>>>>>> in an
>>> attempt to get away from a client bug that was crashing my backend  
>>> services.
>>>>>>>
>>>>>>> Unfortunately, this morning I had a server crash, and it  
>>>>>>> brought down
>>> my entire cluster.  I don't have the logs leading up to the crash,  
>>> because
>>> -- argghffbuggle -- log4j wasn't set up correctly.  But I  
>>> restarted all
>>> three nodes, and odes two and three came back up and formed a  
>>> quorum.
>>>>>>>
>>>>>>> Node one, meanwhile, does this:
>>>>>>>
>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
>>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>>>>>> 2010-06-02 17:04:56,476 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New  
>>> election.
>>> My id =  1, Proposed zxid = 47244640287
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] -  
>>> Notification:
>>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -  
>>> Notification:
>>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -  
>>> Notification:
>>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created  
>>> server
>>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000  
>>> datadir
>>> /services/zookeeper/data/zookeeper/version-2 snapdir
>>> /services/zookeeper/data/zookeeper/version-2
>>>>>>> 2010-06-02 17:04:56,486 - FATAL
>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is  
>>> less
>>> than our epoch b
>>>>>>> 2010-06-02 17:04:56,486 - WARN
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when  
>>> following
>>> the leader
>>>>>>> java.io.IOException: Error: Epoch of leader is lower
>>>>>>>        at
>>> org 
>>> .apache 
>>> .zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
>>> 644)
>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>>>>>> java.lang.Exception: shutdown Follower
>>>>>>>        at
>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java: 
>>> 166)
>>>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
>>> 648)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> All I can find is this,
>>> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html

>>> ,
>>> which implies that this state should never happen.
>>>>>>>
>>>>>>> Any suggestions?  If it happens again, I'll just have to roll
>>> everything back to 3.2.1 and live with the client crashes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>


Mime
View raw message