hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Travis Crawford <traviscrawf...@gmail.com>
Subject Re: zookeeper crash
Date Tue, 06 Jul 2010 23:58:23 GMT
Hey all -

I believe we just suffered an outage from this issue. Short version is
while restarting quorum members with GC flags recommended in the
Troubleshooting wiki page a follower logged messages similar two the
following jiras:

2010-07-06 23:14:01,438 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch 20 is
less than our epoch 21
2010-07-06 23:14:01,438 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when
following the leader
java.io.IOException: Error: Epoch of leader is lower

    https://issues.apache.org/jira/browse/ZOOKEEPER-335
    https://issues.apache.org/jira/browse/ZOOKEEPER-790

Reading through the jira's its unclear if the issue is well understood
at this point (as there's a patch available) or still being
understood.

If its still being understood let me know and I can attach the
relevant log lines to the appropriate jira.

Or if the patch appears good I can make a new release and help test.
Let me know :)

--travis





On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> I would recommend opening a separate jira issue. I'm not convinced the
> issues are the same, so I'd rather keep them separate and link the issues if
> it is the case.
>
> -Flavio
>
> On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:
>
>> We are unable to reproduce this issue. If you can provide the server
>> logs (all servers) and attach them to the jira it would be very helpful.
>> Some detail on the approx time of the issue so we can correlate to the
>> logs would help too (summary of what you did/do to cause it, etc...
>> anything that might help us nail this one down).
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-335
>>
>> Some detail on ZK version, OS, Java version, HW info, etc... would also
>> be of use to us.
>>
>> Patrick
>>
>> On 06/16/2010 02:49 PM, Vishal K wrote:
>>>
>>> Hi,
>>>
>>> We are running into this bug very often (almost 60-75% hit rate) while
>>> testing our newly developed application over ZK. This is almost a blocker
>>> for us. Will the fix be simplified if backward compatibility was not an
>>> issue?
>>>
>>> Considering that this bug is rarely reported, I am wondering why we are
>>> running into this problem so often. Also, on a side note, I am curious
>>> why
>>> the systest that comes with ZooKeeper did not detect this bug. Can anyone
>>> please give an overview of the problem?
>>>
>>> Thanks.
>>> -Vishal
>>>
>>>
>>> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors<charity@shopkick.com>
>>>  wrote:
>>>
>>>> Sure thing.
>>>>
>>>> We got paged this morning because backend services were not able to
>>>> write
>>>> to the database.  Each server discovers the DB master using zookeeper,
>>>> so
>>>> when zookeeper goes down, they assume they no longer know who the DB
>>>> master
>>>> is and stop working.
>>>>
>>>> When we realized there were no problems with the database, we logged in
>>>> to
>>>> the zookeeper nodes.  We weren't able to connect to zookeeper using
>>>> zkCli.sh
>>>> from any of the three nodes, so we decided to restart them all, starting
>>>> with node one.  However, after restarting node one, the cluster started
>>>> responding normally again.
>>>>
>>>> (The timestamps on the zookeeper processes on nodes two and three *are*
>>>> dated today, but none of us restarted them.  We checked shell histories
>>>> and
>>>> sudo logs, and they seem to back us up.)
>>>>
>>>> We tried getting node one to come back up and join the cluster, but
>>>> that's
>>>> when we realized we weren't getting any logs, because log4j.properties
>>>> was
>>>> in the wrong location.  Sorry -- I REALLY wish I had those logs for you.
>>>>  We
>>>> put log4j back in place, and that's when we saw the spew I pasted in my
>>>> first message.
>>>>
>>>> I'll tack this on to ZK-335.
>>>>
>>>>
>>>>
>>>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>>>>
>>>>> charity, do you mind going through your scenario again to give a
>>>>> timeline for the failure? i'm a bit confused as to what happened.
>>>>>
>>>>> ben
>>>>>
>>>>> On 06/02/2010 01:32 PM, Charity Majors wrote:
>>>>>>
>>>>>> Thanks.  That worked for me.  I'm a little confused about why it
threw
>>>>
>>>> the entire cluster into an unusable state, though.
>>>>>>
>>>>>> I said before that we restarted all three nodes, but tracing back,
we
>>>>
>>>> actually didn't.  The zookeeper cluster was refusing all connections
>>>> until
>>>> we restarted node one.  But once node one had been dropped from the
>>>> cluster,
>>>> the other two nodes formed a quorum and started responding to queries on
>>>> their own.
>>>>>>
>>>>>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so
>>>>>> thought
>>>>
>>>> I'd mention it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Charity, unfortunately this is a known issue not specific
to 3.3
>>>>
>>>> that
>>>>>>>
>>>>>>> we are working to address. See this thread for some background:
>>>>>>>
>>>>>>>
>>>>
>>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>>>>>>
>>>>>>> I've raised the JIRA level to "blocker" to ensure we address
this
>>>>>>> asap.
>>>>>>>
>>>>>>> As Ted suggested you can remove the datadir -- only on the effected
>>>>>>> server -- and then restart it. That should resolve the issue
(the
>>>>
>>>> server
>>>>>>>
>>>>>>> will d/l a snapshot of the current db from the leader).
>>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>>>>>>
>>>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1,
in an
>>>>
>>>> attempt to get away from a client bug that was crashing my backend
>>>> services.
>>>>>>>>
>>>>>>>> Unfortunately, this morning I had a server crash, and it
brought
>>>>>>>> down
>>>>
>>>> my entire cluster.  I don't have the logs leading up to the crash,
>>>> because
>>>> -- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all
>>>> three nodes, and odes two and three came back up and formed a quorum.
>>>>>>>>
>>>>>>>> Node one, meanwhile, does this:
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
>>>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,476 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New
>>>> election.
>>>> My id =  1, Proposed zxid = 47244640287
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] -
>>>> Notification:
>>>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -
>>>> Notification:
>>>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -
>>>> Notification:
>>>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server
>>>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
>>>> datadir
>>>> /services/zookeeper/data/zookeeper/version-2 snapdir
>>>> /services/zookeeper/data/zookeeper/version-2
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - FATAL
>>>>
>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less
>>>> than our epoch b
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - WARN
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when
>>>> following
>>>> the leader
>>>>>>>>
>>>>>>>> java.io.IOException: Error: Epoch of leader is lower
>>>>>>>>       at
>>>>
>>>>
>>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>>>>>>>
>>>>>>>>       at
>>>>
>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>>>>>>>
>>>>>>>> java.lang.Exception: shutdown Follower
>>>>>>>>       at
>>>>
>>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>>>>>>>>
>>>>>>>>       at
>>>>
>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> All I can find is this,
>>>>
>>>>
>>>> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html,
>>>> which implies that this state should never happen.
>>>>>>>>
>>>>>>>> Any suggestions?  If it happens again, I'll just have to
roll
>>>>
>>>> everything back to 3.2.1 and live with the client crashes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>
>

Mime
View raw message