Hi Charity, This is certainly not expected. It would be very useful if
you could provide us with as much information about your issue as
possible. I would suggest that either you create a new jira and link
it to ZOOKEEPER-335, or that you add to 335 directly.
We'll be looking further into why you have seen this problem and
working on a fix.
Cheers,
-Flavio
On Jun 2, 2010, at 10:32 PM, Charity Majors wrote:
> Thanks. That worked for me. I'm a little confused about why it
> threw the entire cluster into an unusable state, though.
>
> I said before that we restarted all three nodes, but tracing back,
> we actually didn't. The zookeeper cluster was refusing all
> connections until we restarted node one. But once node one had been
> dropped from the cluster, the other two nodes formed a quorum and
> started responding to queries on their own.
>
> Is that expected as well? I didn't see it in ZOOKEEPER-335, so
> thought I'd mention it.
>
>
>
> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>
>> Hi Charity, unfortunately this is a known issue not specific to 3.3
>> that
>> we are working to address. See this thread for some background:
>>
>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>
>> I've raised the JIRA level to "blocker" to ensure we address this
>> asap.
>>
>> As Ted suggested you can remove the datadir -- only on the effected
>> server -- and then restart it. That should resolve the issue (the
>> server
>> will d/l a snapshot of the current db from the leader).
>>
>> Patrick
>>
>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in
>>> an attempt to get away from a client bug that was crashing my
>>> backend services.
>>>
>>> Unfortunately, this morning I had a server crash, and it brought
>>> down my entire cluster. I don't have the logs leading up to the
>>> crash, because -- argghffbuggle -- log4j wasn't set up correctly.
>>> But I restarted all three nodes, and odes two and three came back
>>> up and formed a quorum.
>>>
>>> Node one, meanwhile, does this:
>>>
>>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>> 2010-06-02 17:04:56,446 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /services/
>>> zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>> 2010-06-02 17:04:56,476 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. My id
>>> = 1, Proposed zxid = 47244640287
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: 1,
>>> 47244640287, 4, 1, LOOKING, LOOKING, 1
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3,
>>> 38654707048, 3, 1, LOOKING, LEADING, 3
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification: 3,
>>> 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with
>>> tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
>>> datadir /services/zookeeper/data/zookeeper/version-2 snapdir /
>>> services/zookeeper/data/zookeeper/version-2
>>> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less than
>>> our epoch b
>>> 2010-06-02 17:04:56,486 - WARN [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following the
>>> leader
>>> java.io.IOException: Error: Epoch of leader is lower
>>> at
>>> org
>>> .apache
>>> .zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>> at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
>>> 644)
>>> 2010-06-02 17:04:56,486 - INFO [QuorumPeer:/
>>> 0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>> java.lang.Exception: shutdown Follower
>>> at
>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:
>>> 166)
>>> at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:
>>> 648)
>>>
>>>
>>>
>>> All I can find is this, http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html
>>> , which implies that this state should never happen.
>>>
>>> Any suggestions? If it happens again, I'll just have to roll
>>> everything back to 3.2.1 and live with the client crashes.
>>>
>>>
>>>
>>>
>
|