aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <da...@dmclaughlin.com>
Subject Re: Proposal: External Update Coordination
Date Thu, 16 Oct 2014 01:36:24 GMT
+1 for pause being explicit RPC pauses, but does it really add complexity
to just add a new state (WAITING?) when no heartbeat is sent? Not being
able to see that an update was blocked because of a lack of heartbeat seems
like a missing feature.

On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <maxim@apache.org> wrote:

> +1. Updated the doc:
>
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
>
> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfarner@apache.org> wrote:
> > +1 to the scheduler not proceeding on an update when heartbeats are
> absent,
> > and requiring the heartbeat service to explicitly call pauseJobUpdate
> when
> > it detects problems.
> >
> > -=Bill
> >
> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney
> <ksweeney@twitter.com.invalid
> >> wrote:
> >
> >> Chatted with Maxim and Bill, I think we figured it out
> >>
> >> I think the confusion stems from the fact that there are two types of
> >> pauses in this system, explicit, persisted pauses generated by the
> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to the
> absence
> >> of a sufficiently fresh heartbeat (such as in the case of a network
> >> partition).
> >>
> >> In case a monitoring service detects a problem it should call the
> explicit
> >> pauseJobUpdate RPC, which will cause a state change that requires an
> >> explicit resumeJobUpdate RPC to resume. That feature already exists.
> >>
> >> But, we need one more thing to make this reliable - heartbeats to
> protect
> >> against network partitions between the scheduler and the monitoring
> >> service. These can be volatile and lightweight - the scheduler just
> checks
> >> for a sufficiently fresh heartbeat before it performs an update action,
> and
> >> if none is present it simply refuses to perform the action. If the
> >> partition heals a new heartbeat will arrive (if the update being
> monitored
> >> should still be allowed to proceed) and the scheduler will allow the
> update
> >> to proceed.
> >>
> >>
> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfarner@apache.org>
> wrote:
> >>
> >> > I think we should assess that after building the rest of the feature.
> >> IIUC
> >> > the rest of the code doesn't care if the update is initially paused.
> >> >
> >> > -=Bill
> >> >
> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <maxim@apache.org
> >
> >> > wrote:
> >> >
> >> > > Can we get a consensus here? Looks like the only sticky point left
> is
> >> > > around starting an update in paused vs. non-paused state. I can
> argue
> >> > > either way as it's easy to add later if needed.
> >> > >
> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfarner@apache.org>
> >> wrote:
> >> > > > I'm not arguing against the merits of the approach.  Just feeling
> out
> >> > > > whether that should be done _after_ the rest of the heartbeat
> >> support.
> >> > > > Seems like it can be cleanly added at the end to get something
> usable
> >> > > > earlier.
> >> > > >
> >> > > > -=Bill
> >> > > >
> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <
> kevints@apache.org>
> >> > > wrote:
> >> > > >
> >> > > >> I'm +1 for using lack of heartbeats as a uniform
> >> unknown-or-unhealthy
> >> > > >> signal, and punting on a more complex NACK signal (which
we'd
> have
> >> to
> >> > > >> reliably persist).
> >> > > >>
> >> > > >> I think the only disagreement in this thread is whether the
> default
> >> > > state
> >> > > >> for a new update should be running or waiting-for-heartbeat.
I
> think
> >> > > >> waiting for a heartbeat is not only a more correct implementation
> >> (no
> >> > > risk
> >> > > >> of acting after a failover but before the heartbeat timeout)
but
> >> > > simpler to
> >> > > >> implement (initialize the PulseMonitor data structure as
empty
> >> rather
> >> > > than
> >> > > >> with a synthetic heartbeat).
> >> > > >>
> >> > > >> From an API consumer perspective the sequence is:
> >> > > >>
> >> > > >> 1. API client sends a startUpdate RPC to the scheduler
> >> > > >> 2. API client receives an OK response, then arranges for
> something
> >> to
> >> > > call
> >> > > >> heartbeat with that updateId on some interval
> >> > > >> 3. Whatever is supposed to send heartbeats sends one immediately,
> >> then
> >> > > >> starts sending them on some smaller interval
> >> > > >>
> >> > > >> Waiting for the first heartbeat ensures that this sequence
has
> been
> >> > > >> completed successfully, while not waiting for it only ensure
that
> >> > step 1
> >> > > >> has happened.
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <
> wfarner@apache.org>
> >> > > wrote:
> >> > > >>
> >> > > >> > Wait - simpler solution than what?  We're talking about
not
> doing
> >> > > either.
> >> > > >> >
> >> > > >> > -=Bill
> >> > > >> >
> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <
> >> kevints@apache.org
> >> > >
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > I think waiting for the first heartbeat before
taking any
> action
> >> > is
> >> > > the
> >> > > >> > > simpler solution here as it allows the implementation
to be
> >> > entirely
> >> > > >> > > soft-state and still catches the bugs I described.
> >> > > >> > >
> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId>
-
> >> heartbeat
> >> > > calls
> >> > > >> > > pulse and mutation operations check isAlive. I
think the code
> >> > might
> >> > > >> > > actually work as-is.
> >> > > >> > >
> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko
<
> >> > > maxim@apache.org>
> >> > > >> > > wrote:
> >> > > >> > >
> >> > > >> > > > Pausing update on creation seems like a logical
approach
> when
> >> > > dealing
> >> > > >> > > > with inverted dependency model. I.e. updater
is happy to
> act
> >> as
> >> > > long
> >> > > >> > > > as it's greenlighted by the external signal.
It's also
> aligned
> >> > > with a
> >> > > >> > > > failover experience where coordinated updates
are
> rehydrated
> >> in
> >> > > >> paused
> >> > > >> > > > state waiting for HB awakening. That said,
I am OK punting
> it
> >> > for
> >> > > the
> >> > > >> > > > sake of simplicity for now.
> >> > > >> > > >
> >> > > >> > > > Kevin?
> >> > > >> > > >
> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner
<
> >> > wfarner@apache.org
> >> > > >
> >> > > >> > > wrote:
> >> > > >> > > > > If the goal is to reduce complexity now
and add features
> >> > later,
> >> > > why
> >> > > >> > not
> >> > > >> > > > > nuke both for now - kick off the update
right away, and
> let
> >> > > lack of
> >> > > >> > > > > heartbeats serve as a uniform "unknown
or unhealthy"
> signal?
> >> > > >> > > > >
> >> > > >> > > > > -=Bill
> >> > > >> > > > >
> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim
Khutornenko <
> >> > > >> maxim@apache.org
> >> > > >> > >
> >> > > >> > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > >> I am still +1 on the idea to have
default paused state
> on
> >> > > >> creation.
> >> > > >> > I
> >> > > >> > > > >> think we could still differentiate
between initially
> paused
> >> > and
> >> > > >> > timed
> >> > > >> > > > >> out states internally by looking
at pause reason. It's
> >> quite
> >> > > >> > different
> >> > > >> > > > >> if we want to store explicit NACK
reasons from the
> external
> >> > > >> service
> >> > > >> > > > >> though. That would require persistence
and a bit more
> >> > > complicated
> >> > > >> > > > >> logic.
> >> > > >> > > > >>
> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM,
Kevin Sweeney <
> >> > > >> kevints@apache.org>
> >> > > >> > > > wrote:
> >> > > >> > > > >> > I like the idea of implementing
this scheduler-side
> >> purely
> >> > > >> through
> >> > > >> > > > >> volatile
> >> > > >> > > > >> > state, but the lack of feedback
(generic vs specific
> >> error
> >> > > >> > messages
> >> > > >> > > > when
> >> > > >> > > > >> an
> >> > > >> > > > >> > update is paused) leaves something
to be desired.
> Maybe
> >> we
> >> > > can
> >> > > >> > > address
> >> > > >> > > > >> that
> >> > > >> > > > >> > with a metadata field in the
initial call to
> startUpdate
> >> > > (with
> >> > > >> an
> >> > > >> > > > >> optional
> >> > > >> > > > >> > link to a page where one can
get more rich information
> >> > about
> >> > > the
> >> > > >> > > > state of
> >> > > >> > > > >> > the monitor sending/not sending
heartbeats).
> >> > > >> > > > >> >
> >> > > >> > > > >> > The main drawback is that we
may have to wait a
> maximum
> >> of
> >> > > one
> >> > > >> > > > heartbeat
> >> > > >> > > > >> > interval to find out that an
update should be paused.
> >> > > >> > > > >> >
> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55
PM, Maxim Khutornenko <
> >> > > >> > > maxim@apache.org>
> >> > > >> > > > >> wrote:
> >> > > >> > > > >> >
> >> > > >> > > > >> >> The main reason I preferred
the lack-of-ACK approach
> >> over
> >> > an
> >> > > >> > > explicit
> >> > > >> > > > >> >> NACK one is simplicity.
As Joshua pointed out there
> is
> >> > more
> >> > > >> state
> >> > > >> > > to
> >> > > >> > > > >> >> handle in that case. The
lack-of-ACK model can be
> >> > completely
> >> > > >> > > > >> >> implemented in volatile
memory sidestepping the
> >> persistent
> >> > > >> > storage
> >> > > >> > > > >> >> entirely. With the NACK
we would need to reliably
> >> persist
> >> > > >> > external
> >> > > >> > > > >> >> service call reasons to
survive scheduler failovers.
> >> Not a
> >> > > huge
> >> > > >> > > > >> >> challenge but something
to keep in mind.
> >> > > >> > > > >> >>
> >> > > >> > > > >> >> I still think the simplicity/reliability
tradeoff is
> >> > > acceptable
> >> > > >> > > here
> >> > > >> > > > >> >> if we rely on external service
to abort heartbeats in
> >> case
> >> > > of a
> >> > > >> > > > health
> >> > > >> > > > >> >> alert fired. This can be
explicitly documented as an
> >> > > external
> >> > > >> > > > >> >> integration requirement.
However, If the consensus
> is to
> >> > go
> >> > > a
> >> > > >> > more
> >> > > >> > > > >> >> reliable (though more complicated)
NACK route I am
> happy
> >> > to
> >> > > >> > > > reconsider
> >> > > >> > > > >> >> the current proposal.
> >> > > >> > > > >> >>
> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at
3:50 PM, Joshua Cohen <
> >> > > >> > > > jcohen@twopensource.com>
> >> > > >> > > > >> >> wrote:
> >> > > >> > > > >> >> > "The heratbeatJobUpdate
RPC serves as an ACK, but
> we
> >> > don't
> >> > > >> > have a
> >> > > >> > > > >> NACK.
> >> > > >> > > > >> >> If
> >> > > >> > > > >> >> > we are going to let
lack-of-ACK serve as the NACK,
> i
> >> > don't
> >> > > >> > think
> >> > > >> > > > it's
> >> > > >> > > > >> >> safe
> >> > > >> > > > >> >> > to resume when we receive
another ACK.  In other
> >> words,
> >> > a
> >> > > >> > service
> >> > > >> > > > >> >> toggling
> >> > > >> > > > >> >> > unhealthy might not
be deemed safe to proceed."
> >> > > >> > > > >> >> >
> >> > > >> > > > >> >> > Lack-of-ACK is the
scenario where connectivity
> between
> >> > the
> >> > > >> > > monitor
> >> > > >> > > > and
> >> > > >> > > > >> >> the
> >> > > >> > > > >> >> > scheduler is unavailable.
Shouldn't the NACK
> scenario
> >> > > >> > (everything
> >> > > >> > > > is
> >> > > >> > > > >> not
> >> > > >> > > > >> >> > ok!) be handled by
the monitoring service
> triggering
> >> an
> >> > > >> > explicit
> >> > > >> > > > >> pause?
> >> > > >> > > > >> >> > I.e. section 2 should
be updated to say "External
> >> > service
> >> > > >> > detects
> >> > > >> > > > >> service
> >> > > >> > > > >> >> > health problems and
pauses the update" and section
> 4
> >> > > becomes
> >> > > >> > the
> >> > > >> > > > >> current
> >> > > >> > > > >> >> > section 2 (i.e. "Should
a heartbeat not be received
> >> the
> >> > > >> > scheduler
> >> > > >> > > > >> pauses
> >> > > >> > > > >> >> > the update.").
> >> > > >> > > > >> >> >
> >> > > >> > > > >> >> > I agree that it's unsafe
to to resume updates after
> >> > > >> receiving a
> >> > > >> > > > >> heartbeat
> >> > > >> > > > >> >> > after previously pausing
due to a missed
> heartbeat. In
> >> > > that
> >> > > >> > > > scenario
> >> > > >> > > > >> I'd
> >> > > >> > > > >> >> > think we'd want an
explicit resumeJobUpdate. If the
> >> > > scenario
> >> > > >> > > we're
> >> > > >> > > > >> trying
> >> > > >> > > > >> >> > to handle is *never*
received a heartbeat, that's a
> >> > > separate
> >> > > >> > > > matter,
> >> > > >> > > > >> in
> >> > > >> > > > >> >> > that case unpausing
upon receiving the first
> heartbeat
> >> > > would
> >> > > >> > make
> >> > > >> > > > >> sense,
> >> > > >> > > > >> >> > but it feels like that
complicates things quite a
> bit
> >> > > (now we
> >> > > >> > > need
> >> > > >> > > > to
> >> > > >> > > > >> >> > differentiate between
heartbeat #1 and hearbeat
> #N).
> >> > > >> > > > >> >> >
> >> > > >> > > > >> >> > On Mon, Oct 13, 2014
at 2:50 PM, Bill Farner <
> >> > > >> > wfarner@apache.org
> >> > > >> > > >
> >> > > >> > > > >> wrote:
> >> > > >> > > > >> >> >
> >> > > >> > > > >> >> >> What is the guidance
for deploying while the
> >> heartbeat
> >> > > >> service
> >> > > >> > > is
> >> > > >> > > > >> >> broken?
> >> > > >> > > > >> >> >> I think i know
the answer, but it's important to
> >> spell
> >> > > out.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> > Create a new
coordinated job update in a paused
> >> > > >> > > > >> (ROLL_FORWARD_PAUSED)
> >> > > >> > > > >> >> >> > state to avoid
any progress until the first
> >> heartbeat
> >> > > call
> >> > > >> > > > arrives.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> I'm not sold on
this being ultimately
> beneficial.  In
> >> > the
> >> > > >> > worst
> >> > > >> > > > case,
> >> > > >> > > > >> >> >> impact is still
limited by the health check
> >> threshold.
> >> > > >> Seems
> >> > > >> > > like
> >> > > >> > > > >> >> >> premature optimization
at best, and an odd one if
> we
> >> > > proceed
> >> > > >> > > > without
> >> > > >> > > > >> a
> >> > > >> > > > >> >> >> 'NACK' signal via
the heartbeatJobUpdate RPC.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> Allow resuming
of the paused-due-to-no-heartbeat
> >> update
> >> > > via
> >> > > >> a
> >> > > >> > > > >> >> >> > resumeJobUpdate
call.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> Are heartbeats
required while rolling back?  If
> so,
> >> > that
> >> > > >> might
> >> > > >> > > > impact
> >> > > >> > > > >> >> the
> >> > > >> > > > >> >> >> design here and
in other places.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> Allow resuming
of the paused-due-to-no-heartbeat
> >> update
> >> > > via
> >> > > >> a
> >> > > >> > > > fresh
> >> > > >> > > > >> >> >> > heartbeatJobUpdate
call.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> The heratbeatJobUpdate
RPC serves as an ACK, but
> we
> >> > don't
> >> > > >> > have a
> >> > > >> > > > >> NACK.
> >> > > >> > > > >> >> If
> >> > > >> > > > >> >> >> we are going to
let lack-of-ACK serve as the
> NACK, i
> >> > > don't
> >> > > >> > think
> >> > > >> > > > it's
> >> > > >> > > > >> >> safe
> >> > > >> > > > >> >> >> to resume when
we receive another ACK.  In other
> >> > words, a
> >> > > >> > > service
> >> > > >> > > > >> >> toggling
> >> > > >> > > > >> >> >> unhealthy might
not be deemed safe to proceed.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> Perhaps just sending
OK (or a NOOP equivalent) in
> >> case
> >> > > of a
> >> > > >> > > > >> user-paused
> >> > > >> > > > >> >> job
> >> > > >> > > > >> >> >> > update would
make more sense as there is nothing
> >> > > >> monitoring
> >> > > >> > > > service
> >> > > >> > > > >> >> could
> >> > > >> > > > >> >> >> > do in that
case. This should work fine with
> >> > > pause/resume
> >> > > >> > > > >> >> -aware/-agnostic
> >> > > >> > > > >> >> >> > monitoring
service implementation.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> This seems reasonable
to me - heartbeats for a
> paused
> >> > > update
> >> > > >> > > > should
> >> > > >> > > > >> not
> >> > > >> > > > >> >> >> pose a risk, but
can be safely ignored.
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> -=Bill
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> On Mon, Oct 13,
2014 at 12:48 PM, Maxim
> Khutornenko <
> >> > > >> > > > >> maxim@apache.org>
> >> > > >> > > > >> >> >> wrote:
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >> >> > Agreed. That
would be a logical generalization
> of
> >> the
> >> > > post
> >> > > >> > > > failover
> >> > > >> > > > >> >> >> > behavior.
> >> > > >> > > > >> >> >> >
> >> > > >> > > > >> >> >> > I have updated
the above document with the
> >> following
> >> > > >> > changes:
> >> > > >> > > > >> >> >> > - Reply with
PAUSED any time a job was paused by
> >> > user;
> >> > > >> > > > >> >> >> > - Start in
paused state by default.
> >> > > >> > > > >> >> >> >
> >> > > >> > > > >> >> >> > On Mon, Oct
13, 2014 at 11:32 AM, Kevin Sweeney
> <
> >> > > >> > > > >> kevints@apache.org>
> >> > > >> > > > >> >> >> > wrote:
> >> > > >> > > > >> >> >> > > The doc
mentioned that the scheduler will
> start
> >> an
> >> > > >> update
> >> > > >> > > > >> subject to
> >> > > >> > > > >> >> >> the
> >> > > >> > > > >> >> >> > > heartbeat
countdown, and if it doesn't
> receive a
> >> > > >> heartbeat
> >> > > >> > > it
> >> > > >> > > > >> will
> >> > > >> > > > >> >> >> pause
> >> > > >> > > > >> >> >> > > the update.
Why not start with the update
> >> > > >> > > > >> >> paused-due-to-no-heartbeat
to
> >> > > >> > > > >> >> >> > > fail-fast
any connectivity issues between the
> >> > service
> >> > > >> > > > providing
> >> > > >> > > > >> the
> >> > > >> > > > >> >> >> > > heartbeats
and the scheduler?
> >> > > >> > > > >> >> >> > >
> >> > > >> > > > >> >> >> > > On Fri,
Oct 10, 2014 at 12:47 PM, Maxim
> >> > Khutornenko <
> >> > > >> > > > >> >> maxim@apache.org>
> >> > > >> > > > >> >> >> > > wrote:
> >> > > >> > > > >> >> >> > >
> >> > > >> > > > >> >> >> > >> Hi
all,
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> > >> We
are proposing a new feature for the
> scheduler
> >> > > >> updater,
> >> > > >> > > > which
> >> > > >> > > > >> you
> >> > > >> > > > >> >> >> > >> may
find helpful.
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> > >> I
have posed a brief feature summary here:
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> >
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >>
> >> > > >> > > > >>
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> > >> Please,
reply with your
> >> > feedback/concerns/comments.
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> > >> Thanks,
> >> > > >> > > > >> >> >> > >> Maxim
> >> > > >> > > > >> >> >> > >>
> >> > > >> > > > >> >> >> >
> >> > > >> > > > >> >> >>
> >> > > >> > > > >> >>
> >> > > >> > > > >>
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Kevin Sweeney
> >> @kts
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message