hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal K <vishalm...@gmail.com>
Subject Re: zookeeper crash
Date Wed, 16 Jun 2010 21:49:35 GMT
Hi,

We are running into this bug very often (almost 60-75% hit rate) while
testing our newly developed application over ZK. This is almost a blocker
for us. Will the fix be simplified if backward compatibility was not an
issue?

Considering that this bug is rarely reported, I am wondering why we are
running into this problem so often. Also, on a side note, I am curious why
the systest that comes with ZooKeeper did not detect this bug. Can anyone
please give an overview of the problem?

Thanks.
-Vishal


On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors <charity@shopkick.com> wrote:

> Sure thing.
>
> We got paged this morning because backend services were not able to write
> to the database.  Each server discovers the DB master using zookeeper, so
> when zookeeper goes down, they assume they no longer know who the DB master
> is and stop working.
>
> When we realized there were no problems with the database, we logged in to
> the zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.sh
> from any of the three nodes, so we decided to restart them all, starting
> with node one.  However, after restarting node one, the cluster started
> responding normally again.
>
> (The timestamps on the zookeeper processes on nodes two and three *are*
> dated today, but none of us restarted them.  We checked shell histories and
> sudo logs, and they seem to back us up.)
>
> We tried getting node one to come back up and join the cluster, but that's
> when we realized we weren't getting any logs, because log4j.properties was
> in the wrong location.  Sorry -- I REALLY wish I had those logs for you.  We
> put log4j back in place, and that's when we saw the spew I pasted in my
> first message.
>
> I'll tack this on to ZK-335.
>
>
>
> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>
> > charity, do you mind going through your scenario again to give a
> > timeline for the failure? i'm a bit confused as to what happened.
> >
> > ben
> >
> > On 06/02/2010 01:32 PM, Charity Majors wrote:
> >> Thanks.  That worked for me.  I'm a little confused about why it threw
> the entire cluster into an unusable state, though.
> >>
> >> I said before that we restarted all three nodes, but tracing back, we
> actually didn't.  The zookeeper cluster was refusing all connections until
> we restarted node one.  But once node one had been dropped from the cluster,
> the other two nodes formed a quorum and started responding to queries on
> their own.
> >>
> >> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought
> I'd mention it.
> >>
> >>
> >>
> >> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
> >>
> >>
> >>> Hi Charity, unfortunately this is a known issue not specific to 3.3
> that
> >>> we are working to address. See this thread for some background:
> >>>
> >>>
> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
> >>>
> >>> I've raised the JIRA level to "blocker" to ensure we address this asap.
> >>>
> >>> As Ted suggested you can remove the datadir -- only on the effected
> >>> server -- and then restart it. That should resolve the issue (the
> server
> >>> will d/l a snapshot of the current db from the leader).
> >>>
> >>> Patrick
> >>>
> >>> On 06/02/2010 11:11 AM, Charity Majors wrote:
> >>>
> >>>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an
> attempt to get away from a client bug that was crashing my backend services.
> >>>>
> >>>> Unfortunately, this morning I had a server crash, and it brought down
> my entire cluster.  I don't have the logs leading up to the crash, because
> -- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all
> three nodes, and odes two and three came back up and formed a quorum.
> >>>>
> >>>> Node one, meanwhile, does this:
> >>>>
> >>>> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@620] - LOOKING
> >>>> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
> /services/zookeeper/data/zookeeper/version-2/snapshot.a00000045
> >>>> 2010-06-02 17:04:56,476 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election.
> My id =  1, Proposed zxid = 47244640287
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification:
> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@642] - FOLLOWING
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server
> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> /services/zookeeper/data/zookeeper/version-2 snapdir
> /services/zookeeper/data/zookeeper/version-2
> >>>> 2010-06-02 17:04:56,486 - FATAL
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@71] - Leader epoch a is less
> than our epoch b
> >>>> 2010-06-02 17:04:56,486 - WARN
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82] - Exception when following
> the leader
> >>>> java.io.IOException: Error: Epoch of leader is lower
> >>>>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
> >>>>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
> >>>> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> >>>> java.lang.Exception: shutdown Follower
> >>>>        at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> >>>>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
> >>>>
> >>>>
> >>>>
> >>>> All I can find is this,
> http://www.mail-archive.com/zookeeper-commits@hadoop.apache.org/msg00449.html,
> which implies that this state should never happen.
> >>>>
> >>>> Any suggestions?  If it happens again, I'll just have to roll
> everything back to 3.2.1 and live with the client crashes.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message