aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Sweeney <kswee...@twitter.com.INVALID>
Subject Re: Proposal: External Update Coordination
Date Wed, 15 Oct 2014 23:59:55 GMT
Chatted with Maxim and Bill, I think we figured it out

I think the confusion stems from the fact that there are two types of
pauses in this system, explicit, persisted pauses generated by the
pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence
of a sufficiently fresh heartbeat (such as in the case of a network
partition).

In case a monitoring service detects a problem it should call the explicit
pauseJobUpdate RPC, which will cause a state change that requires an
explicit resumeJobUpdate RPC to resume. That feature already exists.

But, we need one more thing to make this reliable - heartbeats to protect
against network partitions between the scheduler and the monitoring
service. These can be volatile and lightweight - the scheduler just checks
for a sufficiently fresh heartbeat before it performs an update action, and
if none is present it simply refuses to perform the action. If the
partition heals a new heartbeat will arrive (if the update being monitored
should still be allowed to proceed) and the scheduler will allow the update
to proceed.


On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfarner@apache.org> wrote:

> I think we should assess that after building the rest of the feature.  IIUC
> the rest of the code doesn't care if the update is initially paused.
>
> -=Bill
>
> On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <maxim@apache.org>
> wrote:
>
> > Can we get a consensus here? Looks like the only sticky point left is
> > around starting an update in paused vs. non-paused state. I can argue
> > either way as it's easy to add later if needed.
> >
> > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfarner@apache.org> wrote:
> > > I'm not arguing against the merits of the approach.  Just feeling out
> > > whether that should be done _after_ the rest of the heartbeat support.
> > > Seems like it can be cleanly added at the end to get something usable
> > > earlier.
> > >
> > > -=Bill
> > >
> > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevints@apache.org>
> > wrote:
> > >
> > >> I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
> > >> signal, and punting on a more complex NACK signal (which we'd have to
> > >> reliably persist).
> > >>
> > >> I think the only disagreement in this thread is whether the default
> > state
> > >> for a new update should be running or waiting-for-heartbeat. I think
> > >> waiting for a heartbeat is not only a more correct implementation (no
> > risk
> > >> of acting after a failover but before the heartbeat timeout) but
> > simpler to
> > >> implement (initialize the PulseMonitor data structure as empty rather
> > than
> > >> with a synthetic heartbeat).
> > >>
> > >> From an API consumer perspective the sequence is:
> > >>
> > >> 1. API client sends a startUpdate RPC to the scheduler
> > >> 2. API client receives an OK response, then arranges for something to
> > call
> > >> heartbeat with that updateId on some interval
> > >> 3. Whatever is supposed to send heartbeats sends one immediately, then
> > >> starts sending them on some smaller interval
> > >>
> > >> Waiting for the first heartbeat ensures that this sequence has been
> > >> completed successfully, while not waiting for it only ensure that
> step 1
> > >> has happened.
> > >>
> > >>
> > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfarner@apache.org>
> > wrote:
> > >>
> > >> > Wait - simpler solution than what?  We're talking about not doing
> > either.
> > >> >
> > >> > -=Bill
> > >> >
> > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevints@apache.org
> >
> > >> > wrote:
> > >> >
> > >> > > I think waiting for the first heartbeat before taking any action
> is
> > the
> > >> > > simpler solution here as it allows the implementation to be
> entirely
> > >> > > soft-state and still catches the bugs I described.
> > >> > >
> > >> > > The implementation is just PulseMonitorImpl<UpdateId> -
heartbeat
> > calls
> > >> > > pulse and mutation operations check isAlive. I think the code
> might
> > >> > > actually work as-is.
> > >> > >
> > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <
> > maxim@apache.org>
> > >> > > wrote:
> > >> > >
> > >> > > > Pausing update on creation seems like a logical approach
when
> > dealing
> > >> > > > with inverted dependency model. I.e. updater is happy to
act as
> > long
> > >> > > > as it's greenlighted by the external signal. It's also aligned
> > with a
> > >> > > > failover experience where coordinated updates are rehydrated
in
> > >> paused
> > >> > > > state waiting for HB awakening. That said, I am OK punting
it
> for
> > the
> > >> > > > sake of simplicity for now.
> > >> > > >
> > >> > > > Kevin?
> > >> > > >
> > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <
> wfarner@apache.org
> > >
> > >> > > wrote:
> > >> > > > > If the goal is to reduce complexity now and add features
> later,
> > why
> > >> > not
> > >> > > > > nuke both for now - kick off the update right away,
and let
> > lack of
> > >> > > > > heartbeats serve as a uniform "unknown or unhealthy"
signal?
> > >> > > > >
> > >> > > > > -=Bill
> > >> > > > >
> > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko
<
> > >> maxim@apache.org
> > >> > >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > >> I am still +1 on the idea to have default paused
state on
> > >> creation.
> > >> > I
> > >> > > > >> think we could still differentiate between initially
paused
> and
> > >> > timed
> > >> > > > >> out states internally by looking at pause reason.
It's quite
> > >> > different
> > >> > > > >> if we want to store explicit NACK reasons from
the external
> > >> service
> > >> > > > >> though. That would require persistence and a bit
more
> > complicated
> > >> > > > >> logic.
> > >> > > > >>
> > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney
<
> > >> kevints@apache.org>
> > >> > > > wrote:
> > >> > > > >> > I like the idea of implementing this scheduler-side
purely
> > >> through
> > >> > > > >> volatile
> > >> > > > >> > state, but the lack of feedback (generic vs
specific error
> > >> > messages
> > >> > > > when
> > >> > > > >> an
> > >> > > > >> > update is paused) leaves something to be desired.
Maybe we
> > can
> > >> > > address
> > >> > > > >> that
> > >> > > > >> > with a metadata field in the initial call
to startUpdate
> > (with
> > >> an
> > >> > > > >> optional
> > >> > > > >> > link to a page where one can get more rich
information
> about
> > the
> > >> > > > state of
> > >> > > > >> > the monitor sending/not sending heartbeats).
> > >> > > > >> >
> > >> > > > >> > The main drawback is that we may have to wait
a maximum of
> > one
> > >> > > > heartbeat
> > >> > > > >> > interval to find out that an update should
be paused.
> > >> > > > >> >
> > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko
<
> > >> > > maxim@apache.org>
> > >> > > > >> wrote:
> > >> > > > >> >
> > >> > > > >> >> The main reason I preferred the lack-of-ACK
approach over
> an
> > >> > > explicit
> > >> > > > >> >> NACK one is simplicity. As Joshua pointed
out there is
> more
> > >> state
> > >> > > to
> > >> > > > >> >> handle in that case. The lack-of-ACK model
can be
> completely
> > >> > > > >> >> implemented in volatile memory sidestepping
the persistent
> > >> > storage
> > >> > > > >> >> entirely. With the NACK we would need
to reliably persist
> > >> > external
> > >> > > > >> >> service call reasons to survive scheduler
failovers. Not a
> > huge
> > >> > > > >> >> challenge but something to keep in mind.
> > >> > > > >> >>
> > >> > > > >> >> I still think the simplicity/reliability
tradeoff is
> > acceptable
> > >> > > here
> > >> > > > >> >> if we rely on external service to abort
heartbeats in case
> > of a
> > >> > > > health
> > >> > > > >> >> alert fired. This can be explicitly documented
as an
> > external
> > >> > > > >> >> integration requirement. However, If the
consensus is to
> go
> > a
> > >> > more
> > >> > > > >> >> reliable (though more complicated) NACK
route I am happy
> to
> > >> > > > reconsider
> > >> > > > >> >> the current proposal.
> > >> > > > >> >>
> > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua
Cohen <
> > >> > > > jcohen@twopensource.com>
> > >> > > > >> >> wrote:
> > >> > > > >> >> > "The heratbeatJobUpdate RPC serves
as an ACK, but we
> don't
> > >> > have a
> > >> > > > >> NACK.
> > >> > > > >> >> If
> > >> > > > >> >> > we are going to let lack-of-ACK serve
as the NACK, i
> don't
> > >> > think
> > >> > > > it's
> > >> > > > >> >> safe
> > >> > > > >> >> > to resume when we receive another
ACK.  In other words,
> a
> > >> > service
> > >> > > > >> >> toggling
> > >> > > > >> >> > unhealthy might not be deemed safe
to proceed."
> > >> > > > >> >> >
> > >> > > > >> >> > Lack-of-ACK is the scenario where
connectivity between
> the
> > >> > > monitor
> > >> > > > and
> > >> > > > >> >> the
> > >> > > > >> >> > scheduler is unavailable. Shouldn't
the NACK scenario
> > >> > (everything
> > >> > > > is
> > >> > > > >> not
> > >> > > > >> >> > ok!) be handled by the monitoring
service triggering an
> > >> > explicit
> > >> > > > >> pause?
> > >> > > > >> >> > I.e. section 2 should be updated
to say "External
> service
> > >> > detects
> > >> > > > >> service
> > >> > > > >> >> > health problems and pauses the update"
and section 4
> > becomes
> > >> > the
> > >> > > > >> current
> > >> > > > >> >> > section 2 (i.e. "Should a heartbeat
not be received the
> > >> > scheduler
> > >> > > > >> pauses
> > >> > > > >> >> > the update.").
> > >> > > > >> >> >
> > >> > > > >> >> > I agree that it's unsafe to to resume
updates after
> > >> receiving a
> > >> > > > >> heartbeat
> > >> > > > >> >> > after previously pausing due to a
missed heartbeat. In
> > that
> > >> > > > scenario
> > >> > > > >> I'd
> > >> > > > >> >> > think we'd want an explicit resumeJobUpdate.
If the
> > scenario
> > >> > > we're
> > >> > > > >> trying
> > >> > > > >> >> > to handle is *never* received a heartbeat,
that's a
> > separate
> > >> > > > matter,
> > >> > > > >> in
> > >> > > > >> >> > that case unpausing upon receiving
the first heartbeat
> > would
> > >> > make
> > >> > > > >> sense,
> > >> > > > >> >> > but it feels like that complicates
things quite a bit
> > (now we
> > >> > > need
> > >> > > > to
> > >> > > > >> >> > differentiate between heartbeat #1
and hearbeat #N).
> > >> > > > >> >> >
> > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM,
Bill Farner <
> > >> > wfarner@apache.org
> > >> > > >
> > >> > > > >> wrote:
> > >> > > > >> >> >
> > >> > > > >> >> >> What is the guidance for deploying
while the heartbeat
> > >> service
> > >> > > is
> > >> > > > >> >> broken?
> > >> > > > >> >> >> I think i know the answer, but
it's important to spell
> > out.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> > Create a new coordinated
job update in a paused
> > >> > > > >> (ROLL_FORWARD_PAUSED)
> > >> > > > >> >> >> > state to avoid any progress
until the first heartbeat
> > call
> > >> > > > arrives.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> I'm not sold on this being ultimately
beneficial.  In
> the
> > >> > worst
> > >> > > > case,
> > >> > > > >> >> >> impact is still limited by the
health check threshold.
> > >> Seems
> > >> > > like
> > >> > > > >> >> >> premature optimization at best,
and an odd one if we
> > proceed
> > >> > > > without
> > >> > > > >> a
> > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate
RPC.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update
> > via
> > >> a
> > >> > > > >> >> >> > resumeJobUpdate call.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> Are heartbeats required while
rolling back?  If so,
> that
> > >> might
> > >> > > > impact
> > >> > > > >> >> the
> > >> > > > >> >> >> design here and in other places.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat
update
> > via
> > >> a
> > >> > > > fresh
> > >> > > > >> >> >> > heartbeatJobUpdate call.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> The heratbeatJobUpdate RPC serves
as an ACK, but we
> don't
> > >> > have a
> > >> > > > >> NACK.
> > >> > > > >> >> If
> > >> > > > >> >> >> we are going to let lack-of-ACK
serve as the NACK, i
> > don't
> > >> > think
> > >> > > > it's
> > >> > > > >> >> safe
> > >> > > > >> >> >> to resume when we receive another
ACK.  In other
> words, a
> > >> > > service
> > >> > > > >> >> toggling
> > >> > > > >> >> >> unhealthy might not be deemed
safe to proceed.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Perhaps just sending OK (or a
NOOP equivalent) in case
> > of a
> > >> > > > >> user-paused
> > >> > > > >> >> job
> > >> > > > >> >> >> > update would make more sense
as there is nothing
> > >> monitoring
> > >> > > > service
> > >> > > > >> >> could
> > >> > > > >> >> >> > do in that case. This should
work fine with
> > pause/resume
> > >> > > > >> >> -aware/-agnostic
> > >> > > > >> >> >> > monitoring service implementation.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> This seems reasonable to me -
heartbeats for a paused
> > update
> > >> > > > should
> > >> > > > >> not
> > >> > > > >> >> >> pose a risk, but can be safely
ignored.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> -=Bill
> > >> > > > >> >> >>
> > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48
PM, Maxim Khutornenko <
> > >> > > > >> maxim@apache.org>
> > >> > > > >> >> >> wrote:
> > >> > > > >> >> >>
> > >> > > > >> >> >> > Agreed. That would be a
logical generalization of the
> > post
> > >> > > > failover
> > >> > > > >> >> >> > behavior.
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > I have updated the above
document with the following
> > >> > changes:
> > >> > > > >> >> >> > - Reply with PAUSED any
time a job was paused by
> user;
> > >> > > > >> >> >> > - Start in paused state
by default.
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > On Mon, Oct 13, 2014 at
11:32 AM, Kevin Sweeney <
> > >> > > > >> kevints@apache.org>
> > >> > > > >> >> >> > wrote:
> > >> > > > >> >> >> > > The doc mentioned that
the scheduler will start an
> > >> update
> > >> > > > >> subject to
> > >> > > > >> >> >> the
> > >> > > > >> >> >> > > heartbeat countdown,
and if it doesn't receive a
> > >> heartbeat
> > >> > > it
> > >> > > > >> will
> > >> > > > >> >> >> pause
> > >> > > > >> >> >> > > the update. Why not
start with the update
> > >> > > > >> >> paused-due-to-no-heartbeat to
> > >> > > > >> >> >> > > fail-fast any connectivity
issues between the
> service
> > >> > > > providing
> > >> > > > >> the
> > >> > > > >> >> >> > > heartbeats and the
scheduler?
> > >> > > > >> >> >> > >
> > >> > > > >> >> >> > > On Fri, Oct 10, 2014
at 12:47 PM, Maxim
> Khutornenko <
> > >> > > > >> >> maxim@apache.org>
> > >> > > > >> >> >> > > wrote:
> > >> > > > >> >> >> > >
> > >> > > > >> >> >> > >> Hi all,
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> We are proposing
a new feature for the scheduler
> > >> updater,
> > >> > > > which
> > >> > > > >> you
> > >> > > > >> >> >> > >> may find helpful.
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> I have posed a
brief feature summary here:
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> Please, reply with
your
> feedback/concerns/comments.
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> Thanks,
> > >> > > > >> >> >> > >> Maxim
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>



-- 
Kevin Sweeney
@kts

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message