Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of traviscrawford@gmail.com
 designates 209.85.212.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :cc:content-type:content-transfer-encoding;
        b=rrlGJi2kn8ESZj7eBBJ4Mp9BVrYvCBME1UunM1rCkV77ax83Ebqe36p/kJnnU5orZW
         k8w6qCQ2svw/mlv5hBpmaTIGLK9Gt/fsAENapkfL48rs+8zTki8bX5Oe/P5Ix5nCkTKF
         0aUOCG4FQziZjgx3hxQF/H3YX6a9VH/aozzf0=
MIME-Version: 1.0
In-Reply-To: <ADDD5845-F570-4D41-9FB2-99CDE6713D33@yahoo-inc.com>
References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com>
	<4C06A7A7.6010609@apache.org>
	<14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com>
	<4C06E689.9040909@yahoo-inc.com>
	<0D07AD74-BFF4-4481-BCFF-F2F6F24DDA57@shopkick.com>
	<AANLkTimRebyEU-elbAeK5cswUmzPDCt8Y4t1I0ZDBkQi@mail.gmail.com>
	<4C194D26.9030105@apache.org>
	<ADDD5845-F570-4D41-9FB2-99CDE6713D33@yahoo-inc.com>
Date: Tue, 6 Jul 2010 16:58:23 -0700
Message-ID: <AANLkTilVWcguxyy2-9U7aCFRNxnDHWKgZ5EGaIhDHRd3@mail.gmail.com>
Subject: Re: zookeeper crash
From: Travis Crawford <traviscrawford@gmail.com>
To: zookeeper-user@hadoop.apache.org
Cc: Vishal K <vishalmlst@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hey all -

I believe we just suffered an outage from this issue. Short version is
while restarting quorum members with GC flags recommended in the
Troubleshooting wiki page a follower logged messages similar two the
following jiras:

2010-07-06 23:14:01,438 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch 20 is
less than our epoch 21
2010-07-06 23:14:01,438 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when
following the leader
java.io.IOException: Error: Epoch of leader is lower

    https://issues.apache.org/jira/browse/ZOOKEEPER-335
    https://issues.apache.org/jira/browse/ZOOKEEPER-790

Reading through the jira's its unclear if the issue is well understood
at this point (as there's a patch available) or still being
understood.

If its still being understood let me know and I can attach the
relevant log lines to the appropriate jira.

Or if the patch appears good I can make a new release and help test.
Let me know :)

--travis


On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira <fpj@yahoo-inc.com> wrote=
:
> I would recommend opening a separate jira issue. I'm not convinced the
> issues are the same, so I'd rather keep them separate and link the issues=
 if
> it is the case.
>
> -Flavio
>
> On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:
>
>> We are unable to reproduce this issue. If you can provide the server
>> logs (all servers) and attach them to the jira it would be very helpful.
>> Some detail on the approx time of the issue so we can correlate to the
>> logs would help too (summary of what you did/do to cause it, etc...
>> anything that might help us nail this one down).
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-335
>>
>> Some detail on ZK version, OS, Java version, HW info, etc... would also
>> be of use to us.
>>
>> Patrick
>>
>> On 06/16/2010 02:49 PM, Vishal K wrote:
>>>
>>> Hi,
>>>
>>> We are running into this bug very often (almost 60-75% hit rate) while
>>> testing our newly developed application over ZK. This is almost a block=
er
>>> for us. Will the fix be simplified if backward compatibility was not an
>>> issue?
>>>
>>> Considering that this bug is rarely reported, I am wondering why we are
>>> running into this problem so often. Also, on a side note, I am curious
>>> why
>>> the systest that comes with ZooKeeper did not detect this bug. Can anyo=
ne
>>> please give an overview of the problem?
>>>
>>> Thanks.
>>> -Vishal
>>>
>>>
>>> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors<charity@shopkick.com>
>>> =A0wrote:
>>>
>>>> Sure thing.
>>>>
>>>> We got paged this morning because backend services were not able to
>>>> write
>>>> to the database. =A0Each server discovers the DB master using zookeepe=
r,
>>>> so
>>>> when zookeeper goes down, they assume they no longer know who the DB
>>>> master
>>>> is and stop working.
>>>>
>>>> When we realized there were no problems with the database, we logged i=
n
>>>> to
>>>> the zookeeper nodes. =A0We weren't able to connect to zookeeper using
>>>> zkCli.sh
>>>> from any of the three nodes, so we decided to restart them all, starti=
ng
>>>> with node one. =A0However, after restarting node one, the cluster star=
ted
>>>> responding normally again.
>>>>
>>>> (The timestamps on the zookeeper processes on nodes two and three *are=
*
>>>> dated today, but none of us restarted them. =A0We checked shell histor=
ies
>>>> and
>>>> sudo logs, and they seem to back us up.)
>>>>
>>>> We tried getting node one to come back up and join the cluster, but
>>>> that's
>>>> when we realized we weren't getting any logs, because log4j.properties
>>>> was
>>>> in the wrong location. =A0Sorry -- I REALLY wish I had those logs for =
you.
>>>> =A0We
>>>> put log4j back in place, and that's when we saw the spew I pasted in m=
y
>>>> first message.
>>>>
>>>> I'll tack this on to ZK-335.
>>>>
>>>>
>>>>
>>>> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>>>>
>>>>> charity, do you mind going through your scenario again to give a
>>>>> timeline for the failure? i'm a bit confused as to what happened.
>>>>>
>>>>> ben
>>>>>
>>>>> On 06/02/2010 01:32 PM, Charity Majors wrote:
>>>>>>
>>>>>> Thanks. =A0That worked for me. =A0I'm a little confused about why it=
 threw
>>>>
>>>> the entire cluster into an unusable state, though.
>>>>>>
>>>>>> I said before that we restarted all three nodes, but tracing back, w=
e
>>>>
>>>> actually didn't. =A0The zookeeper cluster was refusing all connections
>>>> until
>>>> we restarted node one. =A0But once node one had been dropped from the
>>>> cluster,
>>>> the other two nodes formed a quorum and started responding to queries =
on
>>>> their own.
>>>>>>
>>>>>> Is that expected as well? =A0I didn't see it in ZOOKEEPER-335, so
>>>>>> thought
>>>>
>>>> I'd mention it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>>>>>
>>>>>>
>>>>>>> Hi Charity, unfortunately this is a known issue not specific to 3.3
>>>>
>>>> that
>>>>>>>
>>>>>>> we are working to address. See this thread for some background:
>>>>>>>
>>>>>>>
>>>>
>>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761=
.html
>>>>>>>
>>>>>>> I've raised the JIRA level to "blocker" to ensure we address this
>>>>>>> asap.
>>>>>>>
>>>>>>> As Ted suggested you can remove the datadir -- only on the effected
>>>>>>> server -- and then restart it. That should resolve the issue (the
>>>>
>>>> server
>>>>>>>
>>>>>>> will d/l a snapshot of the current db from the leader).
>>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>>>>>>
>>>>>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in =
an
>>>>
>>>> attempt to get away from a client bug that was crashing my backend
>>>> services.
>>>>>>>>
>>>>>>>> Unfortunately, this morning I had a server crash, and it brought
>>>>>>>> down
>>>>
>>>> my entire cluster. =A0I don't have the logs leading up to the crash,
>>>> because
>>>> -- argghffbuggle -- log4j wasn't set up correctly. =A0But I restarted =
all
>>>> three nodes, and odes two and three came back up and formed a quorum.
>>>>>>>>
>>>>>>>> Node one, meanwhile, does this:
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,446 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
>>>> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,476 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New
>>>> election.
>>>> My id =3D =A01, Proposed zxid =3D 47244640287
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] -
>>>> Notification:
>>>> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -
>>>> Notification:
>>>> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] -
>>>> Notification:
>>>> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created se=
rver
>>>> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
>>>> datadir
>>>> /services/zookeeper/data/zookeeper/version-2 snapdir
>>>> /services/zookeeper/data/zookeeper/version-2
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - FATAL
>>>>
>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is les=
s
>>>> than our epoch b
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - WARN
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when
>>>> following
>>>> the leader
>>>>>>>>
>>>>>>>> java.io.IOException: Error: Epoch of leader is lower
>>>>>>>> =A0 =A0 =A0 at
>>>>
>>>>
>>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java=
:73)
>>>>>>>>
>>>>>>>> =A0 =A0 =A0 at
>>>>
>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
>>>>>>>>
>>>>>>>> 2010-06-02 17:04:56,486 - INFO
>>>>
>>>> =A0[QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
>>>>>>>>
>>>>>>>> java.lang.Exception: shutdown Follower
>>>>>>>> =A0 =A0 =A0 at
>>>>
>>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166=
)
>>>>>>>>
>>>>>>>> =A0 =A0 =A0 at
>>>>
>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> All I can find is this,
>>>>
>>>>
>>>> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg004=
49.html,
>>>> which implies that this state should never happen.
>>>>>>>>
>>>>>>>> Any suggestions? =A0If it happens again, I'll just have to roll
>>>>
>>>> everything back to 3.2.1 and live with the client crashes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>
>