Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
MIME-Version: 1.0
In-Reply-To: <CAOac0GCZLxgTBP3OX41B_Xo1pk0uVjzedgXd6U-K1Sv-otkPvw@mail.gmail.com>
References: <CAOac0GBWmGOMB-ufdXbobj1GPg5y5H58pi+5gPoZnp2wjnnq_g@mail.gmail.com>
 <CANLc_9K4B2nN8dA_zBeEzMg_oNdTrTrVJ3RkVL=-RujeYA0_FA@mail.gmail.com> <CAOac0GCZLxgTBP3OX41B_Xo1pk0uVjzedgXd6U-K1Sv-otkPvw@mail.gmail.com>
From: Patrick Hunt <phunt@apache.org>
Date: Fri, 26 May 2017 11:45:01 -0700
Message-ID: <CANLc_9LCj9CLVF+hrqWwQEwi+YLPz+WTZJua6iNA75DzbOHGpw@mail.gmail.com>
Subject: Re: Recovering from zxid rollover
To: UserZooKeeper <user@zookeeper.apache.org>
Content-Type: multipart/alternative; boundary="94eb2c1bb2308b7a43055071be9f"
archived-at: Fri, 26 May 2017 18:45:46 -0000

--94eb2c1bb2308b7a43055071be9f
Content-Type: text/plain; charset="UTF-8"

On Wed, May 24, 2017 at 8:08 AM, Mike Heffner <mike@librato.com> wrote:

> On Tue, May 23, 2017 at 10:21 PM, Patrick Hunt <phunt@apache.org> wrote:
>
> > On Tue, May 23, 2017 at 3:47 PM, Mike Heffner <mike@librato.com> wrote:
> >
> > > Hi,
> > >
> > > I'm curious what the best practices are for handling zxid rollover in a
> > ZK
> > > ensemble. We have a few five-node ZK ensembles (some 3.4.8 and some
> > 3.3.6)
> > > and they periodically rollover their zxid. We see the following in the
> > > system logs on the leader node:
> > >
> > > 2017-05-22 12:54:14,117 [myid:15] - ERROR [ProcessThread(sid:15
> > > cport:-1)::ZooKeeperCriticalThread@49] - Severe unrecoverable error,
> > from
> > > thread : ProcessThread(sid:15 cport:-1):
> > > org.apache.zookeeper.server.RequestProcessor$RequestProcesso
> rException:
> > > zxid lower 32 bits have rolled over, forcing re-election, and therefore
> > new
> > > epoch start
> > >
> > > From my best understanding of the code, this exception will end up
> > causing
> > > the leader to enter shutdown():
> > >
> > > https://github.com/apache/zookeeper/blob/09cd5db55446a4b390f
> > > 82e3548b929f19e33430d/src/java/main/org/apache/zookeeper/
> > > server/ZooKeeperServer.java#L464-L464
> > >
> > > This shuts down the zookeeper instance from servicing requests, but the
> > JVM
> > > is still actually running. What we experience is that while this ZK
> > > instance is still running, the remaining follower nodes can't re-elect
> a
> > > leader (at least within 15 mins) and quorum is offline. Our remediation
> > so
> > > far has been to restart the original leader node, at which point the
> > > cluster recovers.
> > >
> > > The two questions I have are:
> > >
> > > 1. Should the remaining 4 nodes be able to re-elect a leader after zxid
> > > rollover without intervention (restarting)?
> > >
> > >
> > Hi Mike.
> >
> > That is the intent. Originally the epoch would rollover and cause the
> > cluster to hang (similar to what you are reporting), the JIRA is here
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1277
> > However the patch, calling shutdown of the leader, was intended to force
> a
> > re-election before the epoch could rollover.
> >
>
> Should the leader JVM actually exit during this shutdown, thereby allowing
> the init system to restart it?
>
>
iirc it should not be necessary but it's been some time since I looked at
it.


>
> >
> >
> > > 2. If the leader enters shutdown() state after a zxid rollover, is
> there
> > > any scenario where it will return to started? If not, how are others
> > > handling this scenario -- maybe a healthcheck that kills/restarts an
> > > instance that is in shutdown state?
> > >
> > >
> > I have run into very few people who have seen the zxid rollover and
> testing
> > under real conditions is not easily done. We have unit tests but that
> code
> > is just not exercised sufficiently in everyday use. You're not seeing
> > what's intended, please create a JIRA and include any additional details
> > you can (e.g. config, logs)
> >
>
> Sure, I've opened one here:
> https://issues.apache.org/jira/browse/ZOOKEEPER-2791
>
>
> >
> > What I heard people (well really one user, I have personally only seen
> this
> > at one site) were doing prior to 1277 was monitoring the epoch number,
> and
> > when it got close to rolling over (within 10% say) they would force the
> > current leader to restart by restarting the process. The intent of 1277
> was
> > to effectively do this automatically.
> >
>
> We are looking at doing something similar, maybe once a week finding the
> current leader and restarting it. From testing this quickly re-elects a new
> leader and resets the zxid to zero so it should avoid the rollover that
> occurs after a few weeks of uptime.
>
>
Exactly. This is pretty much the same scenario that I've seen in the past,
along with a similar workaround.

You might want to take a look at the work Benedict Jin has done here:
https://issues.apache.org/jira/browse/ZOOKEEPER-2789
Given you are seeing this so frequently it might be something you could
collaborate on with the author of the patch? I have not looked at it in
great detail but it may allow you to run longer w/o seeing the issue. I
have not thought through all the implications though... (including b/w
compat).

Patrick


>
> >
> > Patrick
> >
> >
> > >
> > > Cheers,
> > >
> > > Mike
> > >
> > >
> >
>
> Mike
>

--94eb2c1bb2308b7a43055071be9f--