hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: zookeeper crash
Date Wed, 16 Jun 2010 22:16:06 GMT
We are unable to reproduce this issue. If you can provide the server 
logs (all servers) and attach them to the jira it would be very helpful. 
Some detail on the approx time of the issue so we can correlate to the 
logs would help too (summary of what you did/do to cause it, etc... 
anything that might help us nail this one down).

https://issues.apache.org/jira/browse/ZOOKEEPER-335

Some detail on ZK version, OS, Java version, HW info, etc... would also 
be of use to us.

Patrick

On 06/16/2010 02:49 PM, Vishal K wrote:
> Hi,
>
> We are running into this bug very often (almost 60-75% hit rate) while
> testing our newly developed application over ZK. This is almost a blocker
> for us. Will the fix be simplified if backward compatibility was not an
> issue?
>
> Considering that this bug is rarely reported, I am wondering why we are
> running into this problem so often. Also, on a side note, I am curious why
> the systest that comes with ZooKeeper did not detect this bug. Can anyone
> please give an overview of the problem?
>
> Thanks.
> -Vishal
>
>
> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors<charity@shopkick.com>  wrote:
>
>> Sure thing.
>>
>> We got paged this morning because backend services were not able to write
>> to the database.  Each server discovers the DB master using zookeeper, so
>> when zookeeper goes down, they assume they no longer know who the DB master
>> is and stop working.
>>
>> When we realized there were no problems with the database, we logged in to
>> the zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.sh
>> from any of the three nodes, so we decided to restart them all, starting
>> with node one.  However, after restarting node one, the cluster started
>> responding normally again.
>>
>> (The timestamps on the zookeeper processes on nodes two and three *are*
>> dated today, but none of us restarted them.  We checked shell histories and
>> sudo logs, and they seem to back us up.)
>>
>> We tried getting node one to come back up and join the cluster, but that's
>> when we realized we weren't getting any logs, because log4j.properties was
>> in the wrong location.  Sorry -- I REALLY wish I had those logs for you.  We
>> put log4j back in place, and that's when we saw the spew I pasted in my
>> first message.
>>
>> I'll tack this on to ZK-335.
>>
>>
>>
>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>>
>>> charity, do you mind going through your scenario again to give a
>>> timeline for the failure? i'm a bit confused as to what happened.
>>>
>>> ben
>>>
>>> On 06/02/2010 01:32 PM, Charity Majors wrote:
>>>> Thanks.  That worked for me.  I'm a little confused about why it threw
>> the entire cluster into an unusable state, though.
>>>>
>>>> I said before that we restarted all three nodes, but tracing back, we
>> actually didn't.  The zookeeper cluster was refusing all connections until
>> we restarted node one.  But once node one had been dropped from the cluster,
>> the other two nodes formed a quorum and started responding to queries on
>> their own.
>>>>
>>>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought
>> I'd mention it.
>>>>
>>>>
>>>>
>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>>>
>>>>
>>>>> Hi Charity, unfortunately this is a known issue not specific to 3.3
>> that
>>>>> we are working to address. See this thread for some background:
>>>>>
>>>>>
>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>>>>
>>>>> I've raised the JIRA level to "blocker" to ensure we address this asap.
>>>>>
>>>>> As Ted suggested you can remove the datadir -- only on the effected
>>>>> server -- and then restart it. That should resolve the issue (the
>> server
>>>>> will d/l a snapshot of the current db from the leader).
>>>>>
>>>>> Patrick
>>>>>
>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>>>>
>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in
an
>> attempt to get away from a client bug that was crashing my backend services.
>>>>>>
>>>>>> Unfortunately, this morning I had a server crash, and it brought
down
>> my entire cluster.  I don't have the logs leading up to the crash, because
>> -- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all
>> three nodes, and odes two and three came back up and formed a quorum.
>>>>>>
>>>>>> Node one, meanwhile, does this:
>>>>>>
>>>>>> 2010-06-02 17:04:56,446 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>>>>> 2010-06-02 17:04:56,446 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>>>>> 2010-06-02 17:04:56,476 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election.
>> My id =  1, Proposed zxid = 47244640287
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification:
>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification:
>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification:
>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server
>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
>> /services/zookeeper/data/zookeeper/version-2 snapdir
>> /services/zookeeper/data/zookeeper/version-2
>>>>>> 2010-06-02 17:04:56,486 - FATAL
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less
>> than our epoch b
>>>>>> 2010-06-02 17:04:56,486 - WARN
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following
>> the leader
>>>>>> java.io.IOException: Error: Epoch of leader is lower
>>>>>>         at
>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>>>>>         at
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
>>>>>> 2010-06-02 17:04:56,486 - INFO
>>   [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>>>>> java.lang.Exception: shutdown Follower
>>>>>>         at
>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>>>>>>         at
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>>>>>>
>>>>>>
>>>>>>
>>>>>> All I can find is this,
>> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html,
>> which implies that this state should never happen.
>>>>>>
>>>>>> Any suggestions?  If it happens again, I'll just have to roll
>> everything back to 3.2.1 and live with the client crashes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>

Mime
View raw message