aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cohen <jco...@apache.org>
Subject Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)
Date Wed, 24 Aug 2016 18:20:52 GMT
I have this enabled in a test cluster and have not noticed any issues with
it yet. I'd like to roll it out to production before we drop the old code
though.

On Wed, Aug 24, 2016 at 1:10 PM, Zameer Manji <zmanji@apache.org> wrote:

> Could we change the default and drop the old code at the same time? I don't
> see any benefit of letting that hang around.
>
> I have not tested this code yet, but I hope to do it soon.
>
> On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan <Stephan.Erb@blue-yonder.com
> >
> wrote:
>
> > The curator backend has been working well for us so far. I believe it is
> > safe to make it the default for the next release, and to drop the old
> code
> > in the release after that.
> >
> >
> >
> > *From: *John Sirois <jsirois@apache.org>
> > *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org>, "
> > jsirois@apache.org" <jsirois@apache.org>
> > *Date: *Thursday 7 July 2016 at 01:13
> > *To: *Martin HrabovĨin <martin.hrabovcin@gmail.com>
> > *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell <
> > jfarrell@apache.org>, "user@aurora.apache.org" <user@aurora.apache.org>
> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache
> > Curator (`-zk_use_curator`)
> >
> >
> >
> > Now that 0.15.0 has been released, I thought I'd check in on any progress
> > folks have made with testing/deploying the 0.14.0+ with the Aurora
> > Scheduler `-zk_use_curator` flag in-place.
> >
> > There has been 1 fix that will go out in the 0.16.0 release to reduce
> > logger noise on shutdown [1][2] but I have heard no negative (or
> positive)
> > feedback otherwise.
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/AURORA-1729
> >
> > [2] https://reviews.apache.org/r/49578/
> >
> >
> >
> > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsirois@apache.org> wrote:
> >
> >
> >
> >
> >
> > On Thu, Jun 16, 2016 at 12:03 AM, Martin HrabovĨin <
> > martin.hrabovcin@gmail.com> wrote:
> >
> > How should be this flag rolled to existing running cluster? Can it be
> done
> > using rolling update instance by instance or we need to stop the whole
> > cluster and then bring all nodes with new flag?
> >
> >
> >
> > I recommend a whole cluster down, upgrade +  new flag, up.
> >
> >
> >
> > A rolling update should work, but will likely be rocky.  My analysis:
> >
> >
> >
> > The Aurora leader election consists of 2 components, the actual leader
> > election and the resulting advertisement by the leader of itself as the
> > Aurora service endpoint.  These 2 components each use zookeeper and of
> the
> > 2 I only ensured that the advertisement was compatible with old releases
> > (old clients). The leader election portion is completely internal to the
> > Aurora scheduler instances vying for leadership and, under Curator, uses
> a
> > different (enhanced), zookeeper node scheme.  As a result, this is what
> > could happen in a slow roll:
> >
> >
> >
> > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
> >
> > upgrade 0: new-lead, 1: old-lead, 2: old-follow
> >
> >
> >
> > Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1
> > as leader. The result will be both node 0 and node 1 attempting to read
> the
> > mesos distributed log.  Now the log uses its own leader election and the
> > reader must be the leader as things stand, so the Aurora-level leadership
> > "tie" will be broken by one of the 2 Aurora-level leaders failing to
> become
> > the mesos distributed log leader, and that node will restart its
> lifecycle
> > - ie flap.  This will continue to be the case with second node upgrade
> and
> > will not stabilize until the 3rd node is upgraded.
> >
> >
> >
> >
> >
> > 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarrell@apache.org>:
> >
> > +1, will enable on our test clusters to help verify
> >
> > -Jake
> >
> >
> > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsirois@apache.org> wrote:
> >
> > > I'd like to move forward with
> > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
> > > legacy
> > > (Twitter) commons zookeeper libraries used for Aurora leader election
> in
> > > favor of Apache Curator libraries. The change submitted in
> > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
> > > Apache
> > > Curator based service discovery can be enabled with the Aurora
> scheduler
> > > flag `-zk_use_curator`.  I'd like feedback from users who enable this
> > > option.  If you have a test cluster where you can enable
> > `-zk_use_curator`
> > > and exercise leader failure and failover, I'd be grateful. If you have
> > > moved to using this option in production with demonstrable improvements
> > or
> > > even maintenance of status quo, I'd also be grateful for this news. If
> > > you've found regressions or new bugs, I'd love to know about those as
> > well.
> > >
> > > Thanks in advance to all those who find time to test this out on real
> > > systems!
> > >
> >
> >
> >
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message