aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Ly <jordan....@gmail.com>
Subject Re: Reducing Failover Time by Eagerly Reading/Replaying Log in Followers
Date Wed, 26 Jul 2017 23:22:09 GMT
Thanks for the comments everyone!

Bill definitely brings up some good points. I've added additional data
to the document in order to better substantiate the claim.

My original graph was using an incorrect query that did not specify
the correct snapshot_apply time. My new graph gives a little bit more
insight into what the time in 'scheduler_log_recover_nanos_total' was
spent doing (applying the snapshot, actually reading from leveldb, and
some time not captured by metrics). Additionally, I've added actual
logs showing what happens from Mesos disconnecting to the framework up
to the new leader reconnecting to Mesos. The other data points we have
from other failovers are consistent with this one case. Thus according
to the proposal, by keeping a follower's log and volatile store up to
date, we would be able to both: 1) eliminate the time it takes to
apply the snapshot during the actual failover and 2) reduce the amount
of time spent replaying the individual log entries (we only need to
replay from the last time a catchup was triggered).

Echoing what David said, the implementation details would follow after
we ensure this is a reasonable plan and it is a good use of effort.


On Wed, Jul 26, 2017 at 3:25 PM, David McLaughlin
<dmclaughlin@apache.org> wrote:
> One thing we should make clear: we already have a working prototype for
> 'catch-up' logic in the replicated log built. The next step was to take
> this functionality and make use of it in Aurora as a proof-of-concept
> before upstreaming it. The main "threads" we're trying to explore are:
>
> 1) Reducing unplanned failovers (and API timeouts) due to stop the world GC
> pauses.
> 2) Reducing write unavailability due to write lock contention (e.g. 40s
> snapshot times leading to API timeouts every hour)
> 3) Reducing the cost of a failover by speeding up the leader recovery time.
>
> The proposal here is obviously targeted at (3), whereas my patches for
> snapshot deduplication and the snapshot creation proposal were aimed more
> at (2). The big idea we had for (1) was moving snapshots (and backups) into
> followers, which would obviously require Jordan's proposal here be shipped
> first.
>
> It wasn't clear to me how difficult this would be to add to the Scheduler,
> so I wanted to make sure we shared our intentions before investing too much
> effort, in case there was either some fundamental flaw in the approach or
> some easier win.
>
>
> On Wed, Jul 26, 2017 at 12:03 PM, Bill Farner <wfarner@apache.org> wrote:
>
>> Some (hopefully) constructive criticism:
>>
>> - the doc is very high-level on the problem statement and the proposal,
>> making it difficult to agree with prioritization over cheaper snapshots or
>> the oft-discussed support of an external DBMS.
>>
>> - the supporting data is a single data point of the
>> scheduler_log_recover_nanos_total metric.  More data points and more
>> detail
>> on this data (how many entries/bytes did this represent?) would help
>> normalize the metric, and possibly indicate whether recover time is linear
>> or non-linear.  Finer-grained information would also help (where was time
>> spent within the replay - GC?  reading log entries?  inflating snapshots?).
>>
>> - the doc calls out parts (1) mesos log support and (2) scheduler support.
>> Is the planned approach to gain value from (1) before (2), or are both
>> needed?
>>
>> - for (2) scheduler support, can you add detail on the implementation?
>> Much of the scheduler code assumes it is the leader
>> (CallOrderEnforcingStorage is currently a gatekeeper to avoid mistakes of
>> this type), so i would caution against replaying directly into the main
>> Storage.
>>
>>
>> On Wed, Jul 26, 2017 at 1:56 PM, Santhosh Kumar Shanmugham <
>> sshanmugham@twitter.com.invalid> wrote:
>>
>> > +1
>> >
>> > This sets up the stage for more potential benefits by offloading work
>> from
>> > the leading scheduler that consumes stable data (that is not affected by
>> > minor inconsistencies).
>> >
>> > On Wed, Jul 26, 2017 at 10:31 AM, David McLaughlin <
>> dmclaughlin@apache.org
>> > >
>> > wrote:
>> >
>> > > I'm +1 to this approach over my proposal. With the enforced daily
>> > failover,
>> > > it's a much bigger win to make failovers "cheap" than making snapshots
>> > > cheap, and this is going to be backwards compatible too.
>> > >
>> > > On Wed, Jul 26, 2017 at 9:51 AM, Jordan Ly <jordan.ly8@gmail.com>
>> wrote:
>> > >
>> > > > Hello everyone!
>> > > >
>> > > > I've created a document with an initial proposal to reduce leader
>> > > > failover time by eagerly reading and replaying the replicated log
in
>> > > > followers:
>> > > >
>> > > > https://docs.google.com/document/d/10SYOq0ehLMFKQ9rX2TGC_xpM--
>> > > > GBnstzMFP-tXGQaVI/edit?usp=sharing
>> > > >
>> > > > We wanted to open up this topic for discussion with the community
and
>> > > > see if anyone had any alternate opinions or recommendations before
>> > > > starting the work.
>> > > >
>> > > > If this solution seems reasonable, we will write and release a design
>> > > > document for a more formal discussion and review.
>> > > >
>> > > > Please feel free to comment on the doc, or let me know if you have
>> any
>> > > > concerns.
>> > > >
>> > > > -Jordan
>> > > >
>> > >
>> >
>>

Mime
View raw message