aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Proposal: External Update Coordination
Date Tue, 14 Oct 2014 20:03:25 GMT
I'm not arguing against the merits of the approach.  Just feeling out
whether that should be done _after_ the rest of the heartbeat support.
Seems like it can be cleanly added at the end to get something usable
earlier.

-=Bill

On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevints@apache.org> wrote:

> I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
> signal, and punting on a more complex NACK signal (which we'd have to
> reliably persist).
>
> I think the only disagreement in this thread is whether the default state
> for a new update should be running or waiting-for-heartbeat. I think
> waiting for a heartbeat is not only a more correct implementation (no risk
> of acting after a failover but before the heartbeat timeout) but simpler to
> implement (initialize the PulseMonitor data structure as empty rather than
> with a synthetic heartbeat).
>
> From an API consumer perspective the sequence is:
>
> 1. API client sends a startUpdate RPC to the scheduler
> 2. API client receives an OK response, then arranges for something to call
> heartbeat with that updateId on some interval
> 3. Whatever is supposed to send heartbeats sends one immediately, then
> starts sending them on some smaller interval
>
> Waiting for the first heartbeat ensures that this sequence has been
> completed successfully, while not waiting for it only ensure that step 1
> has happened.
>
>
> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfarner@apache.org> wrote:
>
> > Wait - simpler solution than what?  We're talking about not doing either.
> >
> > -=Bill
> >
> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevints@apache.org>
> > wrote:
> >
> > > I think waiting for the first heartbeat before taking any action is the
> > > simpler solution here as it allows the implementation to be entirely
> > > soft-state and still catches the bugs I described.
> > >
> > > The implementation is just PulseMonitorImpl<UpdateId> - heartbeat calls
> > > pulse and mutation operations check isAlive. I think the code might
> > > actually work as-is.
> > >
> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <maxim@apache.org>
> > > wrote:
> > >
> > > > Pausing update on creation seems like a logical approach when dealing
> > > > with inverted dependency model. I.e. updater is happy to act as long
> > > > as it's greenlighted by the external signal. It's also aligned with a
> > > > failover experience where coordinated updates are rehydrated in
> paused
> > > > state waiting for HB awakening. That said, I am OK punting it for the
> > > > sake of simplicity for now.
> > > >
> > > > Kevin?
> > > >
> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <wfarner@apache.org>
> > > wrote:
> > > > > If the goal is to reduce complexity now and add features later, why
> > not
> > > > > nuke both for now - kick off the update right away, and let lack
of
> > > > > heartbeats serve as a uniform "unknown or unhealthy" signal?
> > > > >
> > > > > -=Bill
> > > > >
> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <
> maxim@apache.org
> > >
> > > > wrote:
> > > > >
> > > > >> I am still +1 on the idea to have default paused state on
> creation.
> > I
> > > > >> think we could still differentiate between initially paused and
> > timed
> > > > >> out states internally by looking at pause reason. It's quite
> > different
> > > > >> if we want to store explicit NACK reasons from the external
> service
> > > > >> though. That would require persistence and a bit more complicated
> > > > >> logic.
> > > > >>
> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <
> kevints@apache.org>
> > > > wrote:
> > > > >> > I like the idea of implementing this scheduler-side purely
> through
> > > > >> volatile
> > > > >> > state, but the lack of feedback (generic vs specific error
> > messages
> > > > when
> > > > >> an
> > > > >> > update is paused) leaves something to be desired. Maybe
we can
> > > address
> > > > >> that
> > > > >> > with a metadata field in the initial call to startUpdate
(with
> an
> > > > >> optional
> > > > >> > link to a page where one can get more rich information about
the
> > > > state of
> > > > >> > the monitor sending/not sending heartbeats).
> > > > >> >
> > > > >> > The main drawback is that we may have to wait a maximum
of one
> > > > heartbeat
> > > > >> > interval to find out that an update should be paused.
> > > > >> >
> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <
> > > maxim@apache.org>
> > > > >> wrote:
> > > > >> >
> > > > >> >> The main reason I preferred the lack-of-ACK approach
over an
> > > explicit
> > > > >> >> NACK one is simplicity. As Joshua pointed out there
is more
> state
> > > to
> > > > >> >> handle in that case. The lack-of-ACK model can be completely
> > > > >> >> implemented in volatile memory sidestepping the persistent
> > storage
> > > > >> >> entirely. With the NACK we would need to reliably persist
> > external
> > > > >> >> service call reasons to survive scheduler failovers.
Not a huge
> > > > >> >> challenge but something to keep in mind.
> > > > >> >>
> > > > >> >> I still think the simplicity/reliability tradeoff is
acceptable
> > > here
> > > > >> >> if we rely on external service to abort heartbeats in
case of a
> > > > health
> > > > >> >> alert fired. This can be explicitly documented as an
external
> > > > >> >> integration requirement. However, If the consensus is
to go a
> > more
> > > > >> >> reliable (though more complicated) NACK route I am happy
to
> > > > reconsider
> > > > >> >> the current proposal.
> > > > >> >>
> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <
> > > > jcohen@twopensource.com>
> > > > >> >> wrote:
> > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but
we don't
> > have a
> > > > >> NACK.
> > > > >> >> If
> > > > >> >> > we are going to let lack-of-ACK serve as the NACK,
i don't
> > think
> > > > it's
> > > > >> >> safe
> > > > >> >> > to resume when we receive another ACK.  In other
words, a
> > service
> > > > >> >> toggling
> > > > >> >> > unhealthy might not be deemed safe to proceed."
> > > > >> >> >
> > > > >> >> > Lack-of-ACK is the scenario where connectivity
between the
> > > monitor
> > > > and
> > > > >> >> the
> > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario
> > (everything
> > > > is
> > > > >> not
> > > > >> >> > ok!) be handled by the monitoring service triggering
an
> > explicit
> > > > >> pause?
> > > > >> >> > I.e. section 2 should be updated to say "External
service
> > detects
> > > > >> service
> > > > >> >> > health problems and pauses the update" and section
4 becomes
> > the
> > > > >> current
> > > > >> >> > section 2 (i.e. "Should a heartbeat not be received
the
> > scheduler
> > > > >> pauses
> > > > >> >> > the update.").
> > > > >> >> >
> > > > >> >> > I agree that it's unsafe to to resume updates after
> receiving a
> > > > >> heartbeat
> > > > >> >> > after previously pausing due to a missed heartbeat.
In that
> > > > scenario
> > > > >> I'd
> > > > >> >> > think we'd want an explicit resumeJobUpdate. If
the scenario
> > > we're
> > > > >> trying
> > > > >> >> > to handle is *never* received a heartbeat, that's
a separate
> > > > matter,
> > > > >> in
> > > > >> >> > that case unpausing upon receiving the first heartbeat
would
> > make
> > > > >> sense,
> > > > >> >> > but it feels like that complicates things quite
a bit (now we
> > > need
> > > > to
> > > > >> >> > differentiate between heartbeat #1 and hearbeat
#N).
> > > > >> >> >
> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <
> > wfarner@apache.org
> > > >
> > > > >> wrote:
> > > > >> >> >
> > > > >> >> >> What is the guidance for deploying while the
heartbeat
> service
> > > is
> > > > >> >> broken?
> > > > >> >> >> I think i know the answer, but it's important
to spell out.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> > Create a new coordinated job update in
a paused
> > > > >> (ROLL_FORWARD_PAUSED)
> > > > >> >> >> > state to avoid any progress until the
first heartbeat call
> > > > arrives.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> I'm not sold on this being ultimately beneficial.
 In the
> > worst
> > > > case,
> > > > >> >> >> impact is still limited by the health check
threshold.
> Seems
> > > like
> > > > >> >> >> premature optimization at best, and an odd
one if we proceed
> > > > without
> > > > >> a
> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
> > > > >> >> >>
> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update via
> a
> > > > >> >> >> > resumeJobUpdate call.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> Are heartbeats required while rolling back?
 If so, that
> might
> > > > impact
> > > > >> >> the
> > > > >> >> >> design here and in other places.
> > > > >> >> >>
> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update via
> a
> > > > fresh
> > > > >> >> >> > heartbeatJobUpdate call.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK,
but we don't
> > have a
> > > > >> NACK.
> > > > >> >> If
> > > > >> >> >> we are going to let lack-of-ACK serve as the
NACK, i don't
> > think
> > > > it's
> > > > >> >> safe
> > > > >> >> >> to resume when we receive another ACK.  In
other words, a
> > > service
> > > > >> >> toggling
> > > > >> >> >> unhealthy might not be deemed safe to proceed.
> > > > >> >> >>
> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent)
in case of a
> > > > >> user-paused
> > > > >> >> job
> > > > >> >> >> > update would make more sense as there
is nothing
> monitoring
> > > > service
> > > > >> >> could
> > > > >> >> >> > do in that case. This should work fine
with pause/resume
> > > > >> >> -aware/-agnostic
> > > > >> >> >> > monitoring service implementation.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> This seems reasonable to me - heartbeats for
a paused update
> > > > should
> > > > >> not
> > > > >> >> >> pose a risk, but can be safely ignored.
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> -=Bill
> > > > >> >> >>
> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko
<
> > > > >> maxim@apache.org>
> > > > >> >> >> wrote:
> > > > >> >> >>
> > > > >> >> >> > Agreed. That would be a logical generalization
of the post
> > > > failover
> > > > >> >> >> > behavior.
> > > > >> >> >> >
> > > > >> >> >> > I have updated the above document with
the following
> > changes:
> > > > >> >> >> > - Reply with PAUSED any time a job was
paused by user;
> > > > >> >> >> > - Start in paused state by default.
> > > > >> >> >> >
> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin
Sweeney <
> > > > >> kevints@apache.org>
> > > > >> >> >> > wrote:
> > > > >> >> >> > > The doc mentioned that the scheduler
will start an
> update
> > > > >> subject to
> > > > >> >> >> the
> > > > >> >> >> > > heartbeat countdown, and if it doesn't
receive a
> heartbeat
> > > it
> > > > >> will
> > > > >> >> >> pause
> > > > >> >> >> > > the update. Why not start with the
update
> > > > >> >> paused-due-to-no-heartbeat to
> > > > >> >> >> > > fail-fast any connectivity issues
between the service
> > > > providing
> > > > >> the
> > > > >> >> >> > > heartbeats and the scheduler?
> > > > >> >> >> > >
> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM,
Maxim Khutornenko <
> > > > >> >> maxim@apache.org>
> > > > >> >> >> > > wrote:
> > > > >> >> >> > >
> > > > >> >> >> > >> Hi all,
> > > > >> >> >> > >>
> > > > >> >> >> > >> We are proposing a new feature
for the scheduler
> updater,
> > > > which
> > > > >> you
> > > > >> >> >> > >> may find helpful.
> > > > >> >> >> > >>
> > > > >> >> >> > >> I have posed a brief feature
summary here:
> > > > >> >> >> > >>
> > > > >> >> >> > >>
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > > >> >> >> > >>
> > > > >> >> >> > >> Please, reply with your feedback/concerns/comments.
> > > > >> >> >> > >>
> > > > >> >> >> > >> Thanks,
> > > > >> >> >> > >> Maxim
> > > > >> >> >> > >>
> > > > >> >> >> >
> > > > >> >> >>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message