aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Proposal: External Update Coordination
Date Mon, 13 Oct 2014 22:09:56 GMT
Re: user experience, NACK-via-timeout fails here as well.

"PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED
- Heartbeat failed: high 502 rate".

This is part of the impedance mismatch i'm calling out.

-=Bill

On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney <kevints@apache.org> wrote:

> On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfarner@apache.org> wrote:
>
> > What is the guidance for deploying while the heartbeat service is broken?
> > I think i know the answer, but it's important to spell out.
> >
> >
> >
> > > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> > > state to avoid any progress until the first heartbeat call arrives.
> >
> >
> > I'm not sold on this being ultimately beneficial.  In the worst case,
> > impact is still limited by the health check threshold.  Seems like
> > premature optimization at best, and an odd one if we proceed without a
> > 'NACK' signal via the heartbeatJobUpdate RPC.
>
> The benefit is huge IMO for quickly detecting connectivity issues between
> the scheduler and the heartbeat service. There's a lot more information
> contained in the first successful heartbeat than the second, plus we can
> show the user a message like "PAUSED - Waiting for heartbeat". This is a
> better user experience than waiting for a timeout before revealing that
> progress will never be made.
>
>
> >
> >
> Allow resuming of the paused-due-to-no-heartbeat update via a
> > > resumeJobUpdate call.
> >
> >
> > Are heartbeats required while rolling back?  If so, that might impact the
> > design here and in other places.
> >
> > Allow resuming of the paused-due-to-no-heartbeat update via a fresh
> > > heartbeatJobUpdate call.
> >
> >
> > The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.
> If
> > we are going to let lack-of-ACK serve as the NACK, i don't think it's
> safe
> > to resume when we receive another ACK.  In other words, a service
> toggling
> > unhealthy might not be deemed safe to proceed.
> >
> > Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused
> job
> > > update would make more sense as there is nothing monitoring service
> could
> > > do in that case. This should work fine with pause/resume
> -aware/-agnostic
> > > monitoring service implementation.
> >
> >
> > This seems reasonable to me - heartbeats for a paused update should not
> > pose a risk, but can be safely ignored.
> >
> >
> >
> > -=Bill
> >
> > On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <maxim@apache.org>
> > wrote:
> >
> > > Agreed. That would be a logical generalization of the post failover
> > > behavior.
> > >
> > > I have updated the above document with the following changes:
> > > - Reply with PAUSED any time a job was paused by user;
> > > - Start in paused state by default.
> > >
> > > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevints@apache.org>
> > > wrote:
> > > > The doc mentioned that the scheduler will start an update subject to
> > the
> > > > heartbeat countdown, and if it doesn't receive a heartbeat it will
> > pause
> > > > the update. Why not start with the update paused-due-to-no-heartbeat
> to
> > > > fail-fast any connectivity issues between the service providing the
> > > > heartbeats and the scheduler?
> > > >
> > > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <
> maxim@apache.org>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> We are proposing a new feature for the scheduler updater, which you
> > > >> may find helpful.
> > > >>
> > > >> I have posed a brief feature summary here:
> > > >>
> > > >>
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > >>
> > > >> Please, reply with your feedback/concerns/comments.
> > > >>
> > > >> Thanks,
> > > >> Maxim
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message