aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)
Date Mon, 29 Aug 2016 20:57:47 GMT
I managed to deploy this code in a test cluster and observed no issues.

I still advocate for dropping the old code when we change the default but I
understand concerns that it is risky.

On Mon, Aug 29, 2016 at 1:39 PM, John Sirois <jsirois@apache.org> wrote:

> Thanks for the feedback folks! I'll post a flag default switch shortly.
>
> On Wed, Aug 24, 2016 at 12:20 PM, Joshua Cohen <jcohen@apache.org> wrote:
>
> > I have this enabled in a test cluster and have not noticed any issues
> with
> > it yet. I'd like to roll it out to production before we drop the old code
> > though.
> >
>
> Agreed.  This deserves caution, and fwict the jvm leader code is ~never in
> the refactor path; so even though I too am eager to delete the code, it is
> not an active refactoring burden.
>
>
> > On Wed, Aug 24, 2016 at 1:10 PM, Zameer Manji <zmanji@apache.org> wrote:
> >
> >> Could we change the default and drop the old code at the same time? I
> >> don't
> >> see any benefit of letting that hang around.
> >>
> >> I have not tested this code yet, but I hope to do it soon.
> >>
> >> On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan <
> >> Stephan.Erb@blue-yonder.com>
> >> wrote:
> >>
> >> > The curator backend has been working well for us so far. I believe it
> is
> >> > safe to make it the default for the next release, and to drop the old
> >> code
> >> > in the release after that.
> >> >
> >> >
> >> >
> >> > *From: *John Sirois <jsirois@apache.org>
> >> > *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org>, "
> >> > jsirois@apache.org" <jsirois@apache.org>
> >> > *Date: *Thursday 7 July 2016 at 01:13
> >> > *To: *Martin HrabovĨin <martin.hrabovcin@gmail.com>
> >> > *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell
<
> >> > jfarrell@apache.org>, "user@aurora.apache.org" <
> user@aurora.apache.org>
> >> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to
> Apache
> >>
> >> > Curator (`-zk_use_curator`)
> >> >
> >> >
> >> >
> >> > Now that 0.15.0 has been released, I thought I'd check in on any
> >> progress
> >> > folks have made with testing/deploying the 0.14.0+ with the Aurora
> >> > Scheduler `-zk_use_curator` flag in-place.
> >> >
> >> > There has been 1 fix that will go out in the 0.16.0 release to reduce
> >> > logger noise on shutdown [1][2] but I have heard no negative (or
> >> positive)
> >> > feedback otherwise.
> >> >
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/AURORA-1729
> >> >
> >> > [2] https://reviews.apache.org/r/49578/
> >> >
> >> >
> >> >
> >> > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsirois@apache.org>
> >> wrote:
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Jun 16, 2016 at 12:03 AM, Martin HrabovĨin <
> >> > martin.hrabovcin@gmail.com> wrote:
> >> >
> >> > How should be this flag rolled to existing running cluster? Can it be
> >> done
> >> > using rolling update instance by instance or we need to stop the whole
> >> > cluster and then bring all nodes with new flag?
> >> >
> >> >
> >> >
> >> > I recommend a whole cluster down, upgrade +  new flag, up.
> >> >
> >> >
> >> >
> >> > A rolling update should work, but will likely be rocky.  My analysis:
> >> >
> >> >
> >> >
> >> > The Aurora leader election consists of 2 components, the actual leader
> >> > election and the resulting advertisement by the leader of itself as
> the
> >> > Aurora service endpoint.  These 2 components each use zookeeper and of
> >> the
> >> > 2 I only ensured that the advertisement was compatible with old
> releases
> >> > (old clients). The leader election portion is completely internal to
> the
> >> > Aurora scheduler instances vying for leadership and, under Curator,
> >> uses a
> >> > different (enhanced), zookeeper node scheme.  As a result, this is
> what
> >> > could happen in a slow roll:
> >> >
> >> >
> >> >
> >> > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
> >> >
> >> > upgrade 0: new-lead, 1: old-lead, 2: old-follow
> >> >
> >> >
> >> >
> >> > Here, node 0 will see itself as leader and nodes 1 and 2 will see
> node 1
> >> > as leader. The result will be both node 0 and node 1 attempting to
> read
> >> the
> >> > mesos distributed log.  Now the log uses its own leader election and
> the
> >> > reader must be the leader as things stand, so the Aurora-level
> >> leadership
> >> > "tie" will be broken by one of the 2 Aurora-level leaders failing to
> >> become
> >> > the mesos distributed log leader, and that node will restart its
> >> lifecycle
> >> > - ie flap.  This will continue to be the case with second node upgrade
> >> and
> >> > will not stabilize until the 3rd node is upgraded.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarrell@apache.org>:
> >> >
> >> > +1, will enable on our test clusters to help verify
> >> >
> >> > -Jake
> >> >
> >> >
> >> > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsirois@apache.org>
> >> wrote:
> >> >
> >> > > I'd like to move forward with
> >> > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie:
> removing
> >> > > legacy
> >> > > (Twitter) commons zookeeper libraries used for Aurora leader
> election
> >> in
> >> > > favor of Apache Curator libraries. The change submitted in
> >> > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0
> and
> >> > > Apache
> >> > > Curator based service discovery can be enabled with the Aurora
> >> scheduler
> >> > > flag `-zk_use_curator`.  I'd like feedback from users who enable
> this
> >> > > option.  If you have a test cluster where you can enable
> >> > `-zk_use_curator`
> >> > > and exercise leader failure and failover, I'd be grateful. If you
> have
> >> > > moved to using this option in production with demonstrable
> >> improvements
> >> > or
> >> > > even maintenance of status quo, I'd also be grateful for this news.
> If
> >> > > you've found regressions or new bugs, I'd love to know about those
> as
> >> > well.
> >> > >
> >> > > Thanks in advance to all those who find time to test this out on
> real
> >> > > systems!
> >> > >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message