hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: zookeeper crash
Date Wed, 02 Jun 2010 18:49:11 GMT
Hi Charity, unfortunately this is a known issue not specific to 3.3 that 
we are working to address. See this thread for some background:

http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

I've raised the JIRA level to "blocker" to ensure we address this asap.

As Ted suggested you can remove the datadir -- only on the effected 
server -- and then restart it. That should resolve the issue (the server 
will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to get away
from a client bug that was crashing my backend services.
>
> Unfortunately, this morning I had a server crash, and it brought down my entire cluster.
 I don't have the logs leading up to the crash, because -- argghffbuggle -- log4j wasn't set
up correctly.  But I restarted all three nodes, and odes two and three came back up and formed
a quorum.
>
> Node one, meanwhile, does this:
>
> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading
snapshot /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
> 2010-06-02 17:04:56,476 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649]
- New election. My id =  1, Proposed zxid = 47244640287
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689]
- Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799]
- Notification: 3, 38654707048, 3, 1, LOOKING, LEADING, 3
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799]
- Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151]
- Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/services/zookeeper/data/zookeeper/version-2 snapdir /services/zookeeper/data/zookeeper/version-2
> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader
epoch a is less than our epoch b
> 2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception
when following the leader
> java.io.IOException: Error: Epoch of leader is lower
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
called
> java.lang.Exception: shutdown Follower
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>
>
>
> All I can find is this, http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html,
which implies that this state should never happen.
>
> Any suggestions?  If it happens again, I'll just have to roll everything back to 3.2.1
and live with the client crashes.
>
>
>
>

Mime
View raw message