aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Reducing Failover Time by Eagerly Reading/Replaying Log in Followers
Date Wed, 26 Jul 2017 19:03:13 GMT
Some (hopefully) constructive criticism:

- the doc is very high-level on the problem statement and the proposal,
making it difficult to agree with prioritization over cheaper snapshots or
the oft-discussed support of an external DBMS.

- the supporting data is a single data point of the
scheduler_log_recover_nanos_total metric.  More data points and more detail
on this data (how many entries/bytes did this represent?) would help
normalize the metric, and possibly indicate whether recover time is linear
or non-linear.  Finer-grained information would also help (where was time
spent within the replay - GC?  reading log entries?  inflating snapshots?).

- the doc calls out parts (1) mesos log support and (2) scheduler support.
Is the planned approach to gain value from (1) before (2), or are both
needed?

- for (2) scheduler support, can you add detail on the implementation?
Much of the scheduler code assumes it is the leader
(CallOrderEnforcingStorage is currently a gatekeeper to avoid mistakes of
this type), so i would caution against replaying directly into the main
Storage.


On Wed, Jul 26, 2017 at 1:56 PM, Santhosh Kumar Shanmugham <
sshanmugham@twitter.com.invalid> wrote:

> +1
>
> This sets up the stage for more potential benefits by offloading work from
> the leading scheduler that consumes stable data (that is not affected by
> minor inconsistencies).
>
> On Wed, Jul 26, 2017 at 10:31 AM, David McLaughlin <dmclaughlin@apache.org
> >
> wrote:
>
> > I'm +1 to this approach over my proposal. With the enforced daily
> failover,
> > it's a much bigger win to make failovers "cheap" than making snapshots
> > cheap, and this is going to be backwards compatible too.
> >
> > On Wed, Jul 26, 2017 at 9:51 AM, Jordan Ly <jordan.ly8@gmail.com> wrote:
> >
> > > Hello everyone!
> > >
> > > I've created a document with an initial proposal to reduce leader
> > > failover time by eagerly reading and replaying the replicated log in
> > > followers:
> > >
> > > https://docs.google.com/document/d/10SYOq0ehLMFKQ9rX2TGC_xpM--
> > > GBnstzMFP-tXGQaVI/edit?usp=sharing
> > >
> > > We wanted to open up this topic for discussion with the community and
> > > see if anyone had any alternate opinions or recommendations before
> > > starting the work.
> > >
> > > If this solution seems reasonable, we will write and release a design
> > > document for a more formal discussion and review.
> > >
> > > Please feel free to comment on the doc, or let me know if you have any
> > > concerns.
> > >
> > > -Jordan
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message