aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sirois <jsir...@apache.org>
Subject Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)
Date Thu, 16 Jun 2016 19:18:16 GMT
On Thu, Jun 16, 2016 at 12:03 AM, Martin HrabovĨin <
martin.hrabovcin@gmail.com> wrote:

> How should be this flag rolled to existing running cluster? Can it be done
> using rolling update instance by instance or we need to stop the whole
> cluster and then bring all nodes with new flag?
>

I recommend a whole cluster down, upgrade +  new flag, up.

A rolling update should work, but will likely be rocky.  My analysis:

The Aurora leader election consists of 2 components, the actual leader
election and the resulting advertisement by the leader of itself as the
Aurora service endpoint.  These 2 components each use zookeeper and of the
2 I only ensured that the advertisement was compatible with old releases
(old clients). The leader election portion is completely internal to the
Aurora scheduler instances vying for leadership and, under Curator, uses a
different (enhanced), zookeeper node scheme.  As a result, this is what
could happen in a slow roll:

before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
upgrade 0: new-lead, 1: old-lead, 2: old-follow

Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 as
leader. The result will be both node 0 and node 1 attempting to read the
mesos distributed log.  Now the log uses its own leader election and the
reader must be the leader as things stand, so the Aurora-level leadership
"tie" will be broken by one of the 2 Aurora-level leaders failing to become
the mesos distributed log leader, and that node will restart its lifecycle
- ie flap.  This will continue to be the case with second node upgrade and
will not stabilize until the 3rd node is upgraded.


>
> 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarrell@apache.org>:
>
>> +1, will enable on our test clusters to help verify
>>
>> -Jake
>>
>> On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsirois@apache.org> wrote:
>>
>> > I'd like to move forward with
>> > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
>> > legacy
>> > (Twitter) commons zookeeper libraries used for Aurora leader election in
>> > favor of Apache Curator libraries. The change submitted in
>> > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
>> > Apache
>> > Curator based service discovery can be enabled with the Aurora scheduler
>> > flag `-zk_use_curator`.  I'd like feedback from users who enable this
>> > option.  If you have a test cluster where you can enable
>> `-zk_use_curator`
>> > and exercise leader failure and failover, I'd be grateful. If you have
>> > moved to using this option in production with demonstrable improvements
>> or
>> > even maintenance of status quo, I'd also be grateful for this news. If
>> > you've found regressions or new bugs, I'd love to know about those as
>> well.
>> >
>> > Thanks in advance to all those who find time to test this out on real
>> > systems!
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message