aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cohen <jco...@twopensource.com>
Subject Re: Proposal: External Update Coordination
Date Mon, 13 Oct 2014 22:50:48 GMT
"The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.  If
we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
to resume when we receive another ACK.  In other words, a service toggling
unhealthy might not be deemed safe to proceed."

Lack-of-ACK is the scenario where connectivity between the monitor and the
scheduler is unavailable. Shouldn't the NACK scenario (everything is not
ok!) be handled by the monitoring service triggering an explicit pause?
I.e. section 2 should be updated to say "External service detects service
health problems and pauses the update" and section 4 becomes the current
section 2 (i.e. "Should a heartbeat not be received the scheduler pauses
the update.").

I agree that it's unsafe to to resume updates after receiving a heartbeat
after previously pausing due to a missed heartbeat. In that scenario I'd
think we'd want an explicit resumeJobUpdate. If the scenario we're trying
to handle is *never* received a heartbeat, that's a separate matter, in
that case unpausing upon receiving the first heartbeat would make sense,
but it feels like that complicates things quite a bit (now we need to
differentiate between heartbeat #1 and hearbeat #N).

On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfarner@apache.org> wrote:

> What is the guidance for deploying while the heartbeat service is broken?
> I think i know the answer, but it's important to spell out.
>
>
>
> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> > state to avoid any progress until the first heartbeat call arrives.
>
>
> I'm not sold on this being ultimately beneficial.  In the worst case,
> impact is still limited by the health check threshold.  Seems like
> premature optimization at best, and an odd one if we proceed without a
> 'NACK' signal via the heartbeatJobUpdate RPC.
>
> Allow resuming of the paused-due-to-no-heartbeat update via a
> > resumeJobUpdate call.
>
>
> Are heartbeats required while rolling back?  If so, that might impact the
> design here and in other places.
>
> Allow resuming of the paused-due-to-no-heartbeat update via a fresh
> > heartbeatJobUpdate call.
>
>
> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.  If
> we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
> to resume when we receive another ACK.  In other words, a service toggling
> unhealthy might not be deemed safe to proceed.
>
> Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused job
> > update would make more sense as there is nothing monitoring service could
> > do in that case. This should work fine with pause/resume -aware/-agnostic
> > monitoring service implementation.
>
>
> This seems reasonable to me - heartbeats for a paused update should not
> pose a risk, but can be safely ignored.
>
>
>
> -=Bill
>
> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <maxim@apache.org>
> wrote:
>
> > Agreed. That would be a logical generalization of the post failover
> > behavior.
> >
> > I have updated the above document with the following changes:
> > - Reply with PAUSED any time a job was paused by user;
> > - Start in paused state by default.
> >
> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevints@apache.org>
> > wrote:
> > > The doc mentioned that the scheduler will start an update subject to
> the
> > > heartbeat countdown, and if it doesn't receive a heartbeat it will
> pause
> > > the update. Why not start with the update paused-due-to-no-heartbeat to
> > > fail-fast any connectivity issues between the service providing the
> > > heartbeats and the scheduler?
> > >
> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <maxim@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are proposing a new feature for the scheduler updater, which you
> > >> may find helpful.
> > >>
> > >> I have posed a brief feature summary here:
> > >>
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >>
> > >> Please, reply with your feedback/concerns/comments.
> > >>
> > >> Thanks,
> > >> Maxim
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message