aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Sweeney <kevi...@apache.org>
Subject Re: Proposal: External Update Coordination
Date Tue, 14 Oct 2014 19:38:27 GMT
I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
signal, and punting on a more complex NACK signal (which we'd have to
reliably persist).

I think the only disagreement in this thread is whether the default state
for a new update should be running or waiting-for-heartbeat. I think
waiting for a heartbeat is not only a more correct implementation (no risk
of acting after a failover but before the heartbeat timeout) but simpler to
implement (initialize the PulseMonitor data structure as empty rather than
with a synthetic heartbeat).

>From an API consumer perspective the sequence is:

1. API client sends a startUpdate RPC to the scheduler
2. API client receives an OK response, then arranges for something to call
heartbeat with that updateId on some interval
3. Whatever is supposed to send heartbeats sends one immediately, then
starts sending them on some smaller interval

Waiting for the first heartbeat ensures that this sequence has been
completed successfully, while not waiting for it only ensure that step 1
has happened.


On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfarner@apache.org> wrote:

> Wait - simpler solution than what?  We're talking about not doing either.
>
> -=Bill
>
> On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevints@apache.org>
> wrote:
>
> > I think waiting for the first heartbeat before taking any action is the
> > simpler solution here as it allows the implementation to be entirely
> > soft-state and still catches the bugs I described.
> >
> > The implementation is just PulseMonitorImpl<UpdateId> - heartbeat calls
> > pulse and mutation operations check isAlive. I think the code might
> > actually work as-is.
> >
> > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <maxim@apache.org>
> > wrote:
> >
> > > Pausing update on creation seems like a logical approach when dealing
> > > with inverted dependency model. I.e. updater is happy to act as long
> > > as it's greenlighted by the external signal. It's also aligned with a
> > > failover experience where coordinated updates are rehydrated in paused
> > > state waiting for HB awakening. That said, I am OK punting it for the
> > > sake of simplicity for now.
> > >
> > > Kevin?
> > >
> > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <wfarner@apache.org>
> > wrote:
> > > > If the goal is to reduce complexity now and add features later, why
> not
> > > > nuke both for now - kick off the update right away, and let lack of
> > > > heartbeats serve as a uniform "unknown or unhealthy" signal?
> > > >
> > > > -=Bill
> > > >
> > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <maxim@apache.org
> >
> > > wrote:
> > > >
> > > >> I am still +1 on the idea to have default paused state on creation.
> I
> > > >> think we could still differentiate between initially paused and
> timed
> > > >> out states internally by looking at pause reason. It's quite
> different
> > > >> if we want to store explicit NACK reasons from the external service
> > > >> though. That would require persistence and a bit more complicated
> > > >> logic.
> > > >>
> > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <kevints@apache.org>
> > > wrote:
> > > >> > I like the idea of implementing this scheduler-side purely through
> > > >> volatile
> > > >> > state, but the lack of feedback (generic vs specific error
> messages
> > > when
> > > >> an
> > > >> > update is paused) leaves something to be desired. Maybe we can
> > address
> > > >> that
> > > >> > with a metadata field in the initial call to startUpdate (with
an
> > > >> optional
> > > >> > link to a page where one can get more rich information about
the
> > > state of
> > > >> > the monitor sending/not sending heartbeats).
> > > >> >
> > > >> > The main drawback is that we may have to wait a maximum of one
> > > heartbeat
> > > >> > interval to find out that an update should be paused.
> > > >> >
> > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <
> > maxim@apache.org>
> > > >> wrote:
> > > >> >
> > > >> >> The main reason I preferred the lack-of-ACK approach over
an
> > explicit
> > > >> >> NACK one is simplicity. As Joshua pointed out there is more
state
> > to
> > > >> >> handle in that case. The lack-of-ACK model can be completely
> > > >> >> implemented in volatile memory sidestepping the persistent
> storage
> > > >> >> entirely. With the NACK we would need to reliably persist
> external
> > > >> >> service call reasons to survive scheduler failovers. Not
a huge
> > > >> >> challenge but something to keep in mind.
> > > >> >>
> > > >> >> I still think the simplicity/reliability tradeoff is acceptable
> > here
> > > >> >> if we rely on external service to abort heartbeats in case
of a
> > > health
> > > >> >> alert fired. This can be explicitly documented as an external
> > > >> >> integration requirement. However, If the consensus is to
go a
> more
> > > >> >> reliable (though more complicated) NACK route I am happy
to
> > > reconsider
> > > >> >> the current proposal.
> > > >> >>
> > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <
> > > jcohen@twopensource.com>
> > > >> >> wrote:
> > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we
don't
> have a
> > > >> NACK.
> > > >> >> If
> > > >> >> > we are going to let lack-of-ACK serve as the NACK, i
don't
> think
> > > it's
> > > >> >> safe
> > > >> >> > to resume when we receive another ACK.  In other words,
a
> service
> > > >> >> toggling
> > > >> >> > unhealthy might not be deemed safe to proceed."
> > > >> >> >
> > > >> >> > Lack-of-ACK is the scenario where connectivity between
the
> > monitor
> > > and
> > > >> >> the
> > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario
> (everything
> > > is
> > > >> not
> > > >> >> > ok!) be handled by the monitoring service triggering
an
> explicit
> > > >> pause?
> > > >> >> > I.e. section 2 should be updated to say "External service
> detects
> > > >> service
> > > >> >> > health problems and pauses the update" and section 4
becomes
> the
> > > >> current
> > > >> >> > section 2 (i.e. "Should a heartbeat not be received
the
> scheduler
> > > >> pauses
> > > >> >> > the update.").
> > > >> >> >
> > > >> >> > I agree that it's unsafe to to resume updates after
receiving a
> > > >> heartbeat
> > > >> >> > after previously pausing due to a missed heartbeat.
In that
> > > scenario
> > > >> I'd
> > > >> >> > think we'd want an explicit resumeJobUpdate. If the
scenario
> > we're
> > > >> trying
> > > >> >> > to handle is *never* received a heartbeat, that's a
separate
> > > matter,
> > > >> in
> > > >> >> > that case unpausing upon receiving the first heartbeat
would
> make
> > > >> sense,
> > > >> >> > but it feels like that complicates things quite a bit
(now we
> > need
> > > to
> > > >> >> > differentiate between heartbeat #1 and hearbeat #N).
> > > >> >> >
> > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <
> wfarner@apache.org
> > >
> > > >> wrote:
> > > >> >> >
> > > >> >> >> What is the guidance for deploying while the heartbeat
service
> > is
> > > >> >> broken?
> > > >> >> >> I think i know the answer, but it's important to
spell out.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> > Create a new coordinated job update in a paused
> > > >> (ROLL_FORWARD_PAUSED)
> > > >> >> >> > state to avoid any progress until the first
heartbeat call
> > > arrives.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> I'm not sold on this being ultimately beneficial.
 In the
> worst
> > > case,
> > > >> >> >> impact is still limited by the health check threshold.
 Seems
> > like
> > > >> >> >> premature optimization at best, and an odd one if
we proceed
> > > without
> > > >> a
> > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
> > > >> >> >>
> > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update via a
> > > >> >> >> > resumeJobUpdate call.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> Are heartbeats required while rolling back?  If
so, that might
> > > impact
> > > >> >> the
> > > >> >> >> design here and in other places.
> > > >> >> >>
> > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update via a
> > > fresh
> > > >> >> >> > heartbeatJobUpdate call.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but
we don't
> have a
> > > >> NACK.
> > > >> >> If
> > > >> >> >> we are going to let lack-of-ACK serve as the NACK,
i don't
> think
> > > it's
> > > >> >> safe
> > > >> >> >> to resume when we receive another ACK.  In other
words, a
> > service
> > > >> >> toggling
> > > >> >> >> unhealthy might not be deemed safe to proceed.
> > > >> >> >>
> > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in
case of a
> > > >> user-paused
> > > >> >> job
> > > >> >> >> > update would make more sense as there is nothing
monitoring
> > > service
> > > >> >> could
> > > >> >> >> > do in that case. This should work fine with
pause/resume
> > > >> >> -aware/-agnostic
> > > >> >> >> > monitoring service implementation.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> This seems reasonable to me - heartbeats for a paused
update
> > > should
> > > >> not
> > > >> >> >> pose a risk, but can be safely ignored.
> > > >> >> >>
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> -=Bill
> > > >> >> >>
> > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko
<
> > > >> maxim@apache.org>
> > > >> >> >> wrote:
> > > >> >> >>
> > > >> >> >> > Agreed. That would be a logical generalization
of the post
> > > failover
> > > >> >> >> > behavior.
> > > >> >> >> >
> > > >> >> >> > I have updated the above document with the
following
> changes:
> > > >> >> >> > - Reply with PAUSED any time a job was paused
by user;
> > > >> >> >> > - Start in paused state by default.
> > > >> >> >> >
> > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney
<
> > > >> kevints@apache.org>
> > > >> >> >> > wrote:
> > > >> >> >> > > The doc mentioned that the scheduler will
start an update
> > > >> subject to
> > > >> >> >> the
> > > >> >> >> > > heartbeat countdown, and if it doesn't
receive a heartbeat
> > it
> > > >> will
> > > >> >> >> pause
> > > >> >> >> > > the update. Why not start with the update
> > > >> >> paused-due-to-no-heartbeat to
> > > >> >> >> > > fail-fast any connectivity issues between
the service
> > > providing
> > > >> the
> > > >> >> >> > > heartbeats and the scheduler?
> > > >> >> >> > >
> > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim
Khutornenko <
> > > >> >> maxim@apache.org>
> > > >> >> >> > > wrote:
> > > >> >> >> > >
> > > >> >> >> > >> Hi all,
> > > >> >> >> > >>
> > > >> >> >> > >> We are proposing a new feature for
the scheduler updater,
> > > which
> > > >> you
> > > >> >> >> > >> may find helpful.
> > > >> >> >> > >>
> > > >> >> >> > >> I have posed a brief feature summary
here:
> > > >> >> >> > >>
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > >> >> >> > >>
> > > >> >> >> > >> Please, reply with your feedback/concerns/comments.
> > > >> >> >> > >>
> > > >> >> >> > >> Thanks,
> > > >> >> >> > >> Maxim
> > > >> >> >> > >>
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message