Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Charity Majors <charity@shopkick.com>
To: "zookeeper-user@hadoop.apache.org" <zookeeper-user@hadoop.apache.org>
Date: Wed, 2 Jun 2010 17:17:37 -0700
Subject: Re: zookeeper crash
Thread-Topic: zookeeper crash
Thread-Index: AcsCsizHZO4TYGxaTcG82JI+7IUMBA==
Message-ID: <0D07AD74-BFF4-4481-BCFF-F2F6F24DDA57@shopkick.com>
References: <500B7BF7-5981-4323-9E87-97BA15889341@shopkick.com>
 <4C06A7A7.6010609@apache.org>
 <14F28058-5BC2-4D9B-ABBA-1F354A3B963E@shopkick.com>
 <4C06E689.9040909@yahoo-inc.com>
In-Reply-To: <4C06E689.9040909@yahoo-inc.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Sure thing.

We got paged this morning because backend services were not able to write t=
o the database.  Each server discovers the DB master using zookeeper, so wh=
en zookeeper goes down, they assume they no longer know who the DB master i=
s and stop working.

When we realized there were no problems with the database, we logged in to =
the zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.s=
h from any of the three nodes, so we decided to restart them all, starting =
with node one.  However, after restarting node one, the cluster started res=
ponding normally again.=20

(The timestamps on the zookeeper processes on nodes two and three *are* dat=
ed today, but none of us restarted them.  We checked shell histories and su=
do logs, and they seem to back us up.)

We tried getting node one to come back up and join the cluster, but that's =
when we realized we weren't getting any logs, because log4j.properties was =
in the wrong location.  Sorry -- I REALLY wish I had those logs for you.  W=
e put log4j back in place, and that's when we saw the spew I pasted in my f=
irst message.

I'll tack this on to ZK-335.


On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:

> charity, do you mind going through your scenario again to give a=20
> timeline for the failure? i'm a bit confused as to what happened.
>=20
> ben
>=20
> On 06/02/2010 01:32 PM, Charity Majors wrote:
>> Thanks.  That worked for me.  I'm a little confused about why it threw t=
he entire cluster into an unusable state, though.
>>=20
>> I said before that we restarted all three nodes, but tracing back, we ac=
tually didn't.  The zookeeper cluster was refusing all connections until we=
 restarted node one.  But once node one had been dropped from the cluster, =
the other two nodes formed a quorum and started responding to queries on th=
eir own.
>>=20
>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought =
I'd mention it.
>>=20
>>=20
>>=20
>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>=20
>>=20
>>> Hi Charity, unfortunately this is a known issue not specific to 3.3 tha=
t
>>> we are working to address. See this thread for some background:
>>>=20
>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.=
html
>>>=20
>>> I've raised the JIRA level to "blocker" to ensure we address this asap.
>>>=20
>>> As Ted suggested you can remove the datadir -- only on the effected
>>> server -- and then restart it. That should resolve the issue (the serve=
r
>>> will d/l a snapshot of the current db from the leader).
>>>=20
>>> Patrick
>>>=20
>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>>=20
>>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an a=
ttempt to get away from a client bug that was crashing my backend services.
>>>>=20
>>>> Unfortunately, this morning I had a server crash, and it brought down =
my entire cluster.  I don't have the logs leading up to the crash, because =
-- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all thr=
ee nodes, and odes two and three came back up and formed a quorum.
>>>>=20
>>>> Node one, meanwhile, does this:
>>>>=20
>>>> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Quor=
umPeer@620] - LOOKING
>>>> 2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:File=
Snap@82] - Reading snapshot /services/zookeeper/data/zookeeper/version-2/sn=
apshot.a00000045
>>>> 2010-06-02 17:04:56,476 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast=
LeaderElection@649] - New election. My id =3D  1, Proposed zxid =3D 4724464=
0287
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast=
LeaderElection@689] - Notification: 1, 47244640287, 4, 1, LOOKING, LOOKING,=
 1
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast=
LeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, LEADING,=
 3
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Fast=
LeaderElection@799] - Notification: 3, 38654707048, 3, 1, LOOKING, FOLLOWIN=
G, 2
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Quor=
umPeer@642] - FOLLOWING
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooK=
eeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000=
 maxSessionTimeout 40000 datadir /services/zookeeper/data/zookeeper/version=
-2 snapdir /services/zookeeper/data/zookeeper/version-2
>>>> 2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll=
ower@71] - Leader epoch a is less than our epoch b
>>>> 2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll=
ower@82] - Exception when following the leader
>>>> java.io.IOException: Error: Epoch of leader is lower
>>>>        at org.apache.zookeeper.server.quorum.Follower.followLeader(Fol=
lower.java:73)
>>>>        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer=
.java:644)
>>>> 2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Foll=
ower@166] - shutdown called
>>>> java.lang.Exception: shutdown Follower
>>>>        at org.apache.zookeeper.server.quorum.Follower.shutdown(Followe=
r.java:166)
>>>>        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer=
.java:648)
>>>>=20
>>>>=20
>>>>=20
>>>> All I can find is this, http://www.mail-archive.com/zookeeper-commits@=
hadoop.apache.org/msg00449.html, which implies that this state should never=
 happen.
>>>>=20
>>>> Any suggestions?  If it happens again, I'll just have to roll everythi=
ng back to 3.2.1 and live with the client crashes.
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>=20
>=20