aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Sirois <jsir...@apache.org>
Subject Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)
Date Wed, 06 Jul 2016 23:13:36 GMT
Now that 0.15.0 has been released, I thought I'd check in on any progress
folks have made with testing/deploying the 0.14.0+ with the Aurora
Scheduler `-zk_use_curator` flag in-place.
There has been 1 fix that will go out in the 0.16.0 release to reduce
logger noise on shutdown [1][2] but I have heard no negative (or positive)
feedback otherwise.

[1] https://issues.apache.org/jira/browse/AURORA-1729
[2] https://reviews.apache.org/r/49578/

On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsirois@apache.org> wrote:

>
>
> On Thu, Jun 16, 2016 at 12:03 AM, Martin HrabovĨin <
> martin.hrabovcin@gmail.com> wrote:
>
>> How should be this flag rolled to existing running cluster? Can it be
>> done using rolling update instance by instance or we need to stop the whole
>> cluster and then bring all nodes with new flag?
>>
>
> I recommend a whole cluster down, upgrade +  new flag, up.
>
> A rolling update should work, but will likely be rocky.  My analysis:
>
> The Aurora leader election consists of 2 components, the actual leader
> election and the resulting advertisement by the leader of itself as the
> Aurora service endpoint.  These 2 components each use zookeeper and of the
> 2 I only ensured that the advertisement was compatible with old releases
> (old clients). The leader election portion is completely internal to the
> Aurora scheduler instances vying for leadership and, under Curator, uses a
> different (enhanced), zookeeper node scheme.  As a result, this is what
> could happen in a slow roll:
>
> before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
> upgrade 0: new-lead, 1: old-lead, 2: old-follow
>
> Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1
> as leader. The result will be both node 0 and node 1 attempting to read the
> mesos distributed log.  Now the log uses its own leader election and the
> reader must be the leader as things stand, so the Aurora-level leadership
> "tie" will be broken by one of the 2 Aurora-level leaders failing to become
> the mesos distributed log leader, and that node will restart its lifecycle
> - ie flap.  This will continue to be the case with second node upgrade and
> will not stabilize until the 3rd node is upgraded.
>
>
>>
>> 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarrell@apache.org>:
>>
>>> +1, will enable on our test clusters to help verify
>>>
>>> -Jake
>>>
>>> On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsirois@apache.org> wrote:
>>>
>>> > I'd like to move forward with
>>> > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
>>> > legacy
>>> > (Twitter) commons zookeeper libraries used for Aurora leader election
>>> in
>>> > favor of Apache Curator libraries. The change submitted in
>>> > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
>>> > Apache
>>> > Curator based service discovery can be enabled with the Aurora
>>> scheduler
>>> > flag `-zk_use_curator`.  I'd like feedback from users who enable this
>>> > option.  If you have a test cluster where you can enable
>>> `-zk_use_curator`
>>> > and exercise leader failure and failover, I'd be grateful. If you have
>>> > moved to using this option in production with demonstrable
>>> improvements or
>>> > even maintenance of status quo, I'd also be grateful for this news. If
>>> > you've found regressions or new bugs, I'd love to know about those as
>>> well.
>>> >
>>> > Thanks in advance to all those who find time to test this out on real
>>> > systems!
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message