aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Sweeney <kswee...@twitter.com.INVALID>
Subject Re: Proposal: External Update Coordination
Date Thu, 16 Oct 2014 17:52:07 GMT
I inferred that authentication was required due to the presence of a
SessionKey in the RPC. Of course any authentication mechanism here could
have serious scaling issues (barring something like HTTP basic auth in
memory)

On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen <jcohen@twopensource.com>
wrote:

> What are our thoughts about authentication with regards to heartbeats? It
> seems like they should be authenticated since there does exist the
> potential for a malicious actor to send its own heartbeats even if the real
> monitoring service has detected a problem and ceased sending heartbeats.
> I'm not sure exactly how large the attack surface is (if the service is
> truly down the scheduler would detect that and roll back the update
> regardless), but I think it's worth discussing as we work on the initial
> design.
>
> On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfarner@apache.org> wrote:
>
> > David - the plan is to synthesize the waiting state.  Exactly how is not
> > yet certain.
> >
> > On Wednesday, October 15, 2014, Maxim Khutornenko <maxim@apache.org>
> > wrote:
> >
> > > It is certainly possible to add new state or a status message but I
> > > don't think it's a blocker for the first iteration. Provided there is
> > > enough demand a state/message could be synthesized during the 'get'
> > > call based on the volatile state.
> > >
> > > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin <
> david@dmclaughlin.com
> > > <javascript:;>> wrote:
> > > > +1 for pause being explicit RPC pauses, but does it really add
> > complexity
> > > > to just add a new state (WAITING?) when no heartbeat is sent? Not
> being
> > > > able to see that an update was blocked because of a lack of heartbeat
> > > seems
> > > > like a missing feature.
> > > >
> > > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <maxim@apache.org
> > > <javascript:;>> wrote:
> > > >
> > > >> +1. Updated the doc:
> > > >>
> > > >>
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > >>
> > > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfarner@apache.org
> > > <javascript:;>> wrote:
> > > >> > +1 to the scheduler not proceeding on an update when heartbeats
> are
> > > >> absent,
> > > >> > and requiring the heartbeat service to explicitly call
> > pauseJobUpdate
> > > >> when
> > > >> > it detects problems.
> > > >> >
> > > >> > -=Bill
> > > >> >
> > > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney
> > > >> <ksweeney@twitter.com.invalid
> > > >> >> wrote:
> > > >> >
> > > >> >> Chatted with Maxim and Bill, I think we figured it out
> > > >> >>
> > > >> >> I think the confusion stems from the fact that there are
two
> types
> > of
> > > >> >> pauses in this system, explicit, persisted pauses generated
by
> the
> > > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due
to
> the
> > > >> absence
> > > >> >> of a sufficiently fresh heartbeat (such as in the case of
a
> network
> > > >> >> partition).
> > > >> >>
> > > >> >> In case a monitoring service detects a problem it should
call the
> > > >> explicit
> > > >> >> pauseJobUpdate RPC, which will cause a state change that
requires
> > an
> > > >> >> explicit resumeJobUpdate RPC to resume. That feature already
> > exists.
> > > >> >>
> > > >> >> But, we need one more thing to make this reliable - heartbeats
to
> > > >> protect
> > > >> >> against network partitions between the scheduler and the
> monitoring
> > > >> >> service. These can be volatile and lightweight - the scheduler
> just
> > > >> checks
> > > >> >> for a sufficiently fresh heartbeat before it performs an
update
> > > action,
> > > >> and
> > > >> >> if none is present it simply refuses to perform the action.
If
> the
> > > >> >> partition heals a new heartbeat will arrive (if the update
being
> > > >> monitored
> > > >> >> should still be allowed to proceed) and the scheduler will
allow
> > the
> > > >> update
> > > >> >> to proceed.
> > > >> >>
> > > >> >>
> > > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <
> wfarner@apache.org
> > > <javascript:;>>
> > > >> wrote:
> > > >> >>
> > > >> >> > I think we should assess that after building the rest
of the
> > > feature.
> > > >> >> IIUC
> > > >> >> > the rest of the code doesn't care if the update is initially
> > > paused.
> > > >> >> >
> > > >> >> > -=Bill
> > > >> >> >
> > > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko
<
> > > maxim@apache.org <javascript:;>
> > > >> >
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > Can we get a consensus here? Looks like the only
sticky point
> > > left
> > > >> is
> > > >> >> > > around starting an update in paused vs. non-paused
state. I
> can
> > > >> argue
> > > >> >> > > either way as it's easy to add later if needed.
> > > >> >> > >
> > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <
> > wfarner@apache.org
> > > <javascript:;>>
> > > >> >> wrote:
> > > >> >> > > > I'm not arguing against the merits of the
approach.  Just
> > > feeling
> > > >> out
> > > >> >> > > > whether that should be done _after_ the rest
of the
> heartbeat
> > > >> >> support.
> > > >> >> > > > Seems like it can be cleanly added at the
end to get
> > something
> > > >> usable
> > > >> >> > > > earlier.
> > > >> >> > > >
> > > >> >> > > > -=Bill
> > > >> >> > > >
> > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney
<
> > > >> kevints@apache.org <javascript:;>>
> > > >> >> > > wrote:
> > > >> >> > > >
> > > >> >> > > >> I'm +1 for using lack of heartbeats as
a uniform
> > > >> >> unknown-or-unhealthy
> > > >> >> > > >> signal, and punting on a more complex
NACK signal (which
> > we'd
> > > >> have
> > > >> >> to
> > > >> >> > > >> reliably persist).
> > > >> >> > > >>
> > > >> >> > > >> I think the only disagreement in this
thread is whether
> the
> > > >> default
> > > >> >> > > state
> > > >> >> > > >> for a new update should be running or
> > waiting-for-heartbeat. I
> > > >> think
> > > >> >> > > >> waiting for a heartbeat is not only a
more correct
> > > implementation
> > > >> >> (no
> > > >> >> > > risk
> > > >> >> > > >> of acting after a failover but before
the heartbeat
> timeout)
> > > but
> > > >> >> > > simpler to
> > > >> >> > > >> implement (initialize the PulseMonitor
data structure as
> > empty
> > > >> >> rather
> > > >> >> > > than
> > > >> >> > > >> with a synthetic heartbeat).
> > > >> >> > > >>
> > > >> >> > > >> From an API consumer perspective the sequence
is:
> > > >> >> > > >>
> > > >> >> > > >> 1. API client sends a startUpdate RPC
to the scheduler
> > > >> >> > > >> 2. API client receives an OK response,
then arranges for
> > > >> something
> > > >> >> to
> > > >> >> > > call
> > > >> >> > > >> heartbeat with that updateId on some interval
> > > >> >> > > >> 3. Whatever is supposed to send heartbeats
sends one
> > > immediately,
> > > >> >> then
> > > >> >> > > >> starts sending them on some smaller interval
> > > >> >> > > >>
> > > >> >> > > >> Waiting for the first heartbeat ensures
that this sequence
> > has
> > > >> been
> > > >> >> > > >> completed successfully, while not waiting
for it only
> ensure
> > > that
> > > >> >> > step 1
> > > >> >> > > >> has happened.
> > > >> >> > > >>
> > > >> >> > > >>
> > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill
Farner <
> > > >> wfarner@apache.org <javascript:;>>
> > > >> >> > > wrote:
> > > >> >> > > >>
> > > >> >> > > >> > Wait - simpler solution than what?
 We're talking about
> > not
> > > >> doing
> > > >> >> > > either.
> > > >> >> > > >> >
> > > >> >> > > >> > -=Bill
> > > >> >> > > >> >
> > > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM,
Kevin Sweeney <
> > > >> >> kevints@apache.org <javascript:;>
> > > >> >> > >
> > > >> >> > > >> > wrote:
> > > >> >> > > >> >
> > > >> >> > > >> > > I think waiting for the first
heartbeat before taking
> > any
> > > >> action
> > > >> >> > is
> > > >> >> > > the
> > > >> >> > > >> > > simpler solution here as it
allows the implementation
> to
> > > be
> > > >> >> > entirely
> > > >> >> > > >> > > soft-state and still catches
the bugs I described.
> > > >> >> > > >> > >
> > > >> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId>
> -
> > > >> >> heartbeat
> > > >> >> > > calls
> > > >> >> > > >> > > pulse and mutation operations
check isAlive. I think
> the
> > > code
> > > >> >> > might
> > > >> >> > > >> > > actually work as-is.
> > > >> >> > > >> > >
> > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11
PM, Maxim Khutornenko <
> > > >> >> > > maxim@apache.org <javascript:;>>
> > > >> >> > > >> > > wrote:
> > > >> >> > > >> > >
> > > >> >> > > >> > > > Pausing update on creation
seems like a logical
> > approach
> > > >> when
> > > >> >> > > dealing
> > > >> >> > > >> > > > with inverted dependency
model. I.e. updater is
> happy
> > to
> > > >> act
> > > >> >> as
> > > >> >> > > long
> > > >> >> > > >> > > > as it's greenlighted by
the external signal. It's
> also
> > > >> aligned
> > > >> >> > > with a
> > > >> >> > > >> > > > failover experience where
coordinated updates are
> > > >> rehydrated
> > > >> >> in
> > > >> >> > > >> paused
> > > >> >> > > >> > > > state waiting for HB awakening.
That said, I am OK
> > > punting
> > > >> it
> > > >> >> > for
> > > >> >> > > the
> > > >> >> > > >> > > > sake of simplicity for
now.
> > > >> >> > > >> > > >
> > > >> >> > > >> > > > Kevin?
> > > >> >> > > >> > > >
> > > >> >> > > >> > > > On Tue, Oct 14, 2014 at
12:05 PM, Bill Farner <
> > > >> >> > wfarner@apache.org <javascript:;>
> > > >> >> > > >
> > > >> >> > > >> > > wrote:
> > > >> >> > > >> > > > > If the goal is to
reduce complexity now and add
> > > features
> > > >> >> > later,
> > > >> >> > > why
> > > >> >> > > >> > not
> > > >> >> > > >> > > > > nuke both for now
- kick off the update right
> away,
> > > and
> > > >> let
> > > >> >> > > lack of
> > > >> >> > > >> > > > > heartbeats serve as
a uniform "unknown or
> unhealthy"
> > > >> signal?
> > > >> >> > > >> > > > >
> > > >> >> > > >> > > > > -=Bill
> > > >> >> > > >> > > > >
> > > >> >> > > >> > > > > On Mon, Oct 13, 2014
at 5:25 PM, Maxim
> Khutornenko <
> > > >> >> > > >> maxim@apache.org <javascript:;>
> > > >> >> > > >> > >
> > > >> >> > > >> > > > wrote:
> > > >> >> > > >> > > > >
> > > >> >> > > >> > > > >> I am still +1
on the idea to have default paused
> > > state
> > > >> on
> > > >> >> > > >> creation.
> > > >> >> > > >> > I
> > > >> >> > > >> > > > >> think we could
still differentiate between
> > initially
> > > >> paused
> > > >> >> > and
> > > >> >> > > >> > timed
> > > >> >> > > >> > > > >> out states internally
by looking at pause reason.
> > > It's
> > > >> >> quite
> > > >> >> > > >> > different
> > > >> >> > > >> > > > >> if we want to
store explicit NACK reasons from
> the
> > > >> external
> > > >> >> > > >> service
> > > >> >> > > >> > > > >> though. That would
require persistence and a bit
> > more
> > > >> >> > > complicated
> > > >> >> > > >> > > > >> logic.
> > > >> >> > > >> > > > >>
> > > >> >> > > >> > > > >> On Mon, Oct 13,
2014 at 5:15 PM, Kevin Sweeney <
> > > >> >> > > >> kevints@apache.org <javascript:;>>
> > > >> >> > > >> > > > wrote:
> > > >> >> > > >> > > > >> > I like the
idea of implementing this
> > scheduler-side
> > > >> >> purely
> > > >> >> > > >> through
> > > >> >> > > >> > > > >> volatile
> > > >> >> > > >> > > > >> > state, but
the lack of feedback (generic vs
> > > specific
> > > >> >> error
> > > >> >> > > >> > messages
> > > >> >> > > >> > > > when
> > > >> >> > > >> > > > >> an
> > > >> >> > > >> > > > >> > update is
paused) leaves something to be
> desired.
> > > >> Maybe
> > > >> >> we
> > > >> >> > > can
> > > >> >> > > >> > > address
> > > >> >> > > >> > > > >> that
> > > >> >> > > >> > > > >> > with a metadata
field in the initial call to
> > > >> startUpdate
> > > >> >> > > (with
> > > >> >> > > >> an
> > > >> >> > > >> > > > >> optional
> > > >> >> > > >> > > > >> > link to a
page where one can get more rich
> > > information
> > > >> >> > about
> > > >> >> > > the
> > > >> >> > > >> > > > state of
> > > >> >> > > >> > > > >> > the monitor
sending/not sending heartbeats).
> > > >> >> > > >> > > > >> >
> > > >> >> > > >> > > > >> > The main
drawback is that we may have to wait a
> > > >> maximum
> > > >> >> of
> > > >> >> > > one
> > > >> >> > > >> > > > heartbeat
> > > >> >> > > >> > > > >> > interval
to find out that an update should be
> > > paused.
> > > >> >> > > >> > > > >> >
> > > >> >> > > >> > > > >> > On Mon, Oct
13, 2014 at 4:55 PM, Maxim
> > Khutornenko
> > > <
> > > >> >> > > >> > > maxim@apache.org <javascript:;>>
> > > >> >> > > >> > > > >> wrote:
> > > >> >> > > >> > > > >> >
> > > >> >> > > >> > > > >> >> The main
reason I preferred the lack-of-ACK
> > > approach
> > > >> >> over
> > > >> >> > an
> > > >> >> > > >> > > explicit
> > > >> >> > > >> > > > >> >> NACK
one is simplicity. As Joshua pointed out
> > > there
> > > >> is
> > > >> >> > more
> > > >> >> > > >> state
> > > >> >> > > >> > > to
> > > >> >> > > >> > > > >> >> handle
in that case. The lack-of-ACK model can
> > be
> > > >> >> > completely
> > > >> >> > > >> > > > >> >> implemented
in volatile memory sidestepping
> the
> > > >> >> persistent
> > > >> >> > > >> > storage
> > > >> >> > > >> > > > >> >> entirely.
With the NACK we would need to
> > reliably
> > > >> >> persist
> > > >> >> > > >> > external
> > > >> >> > > >> > > > >> >> service
call reasons to survive scheduler
> > > failovers.
> > > >> >> Not a
> > > >> >> > > huge
> > > >> >> > > >> > > > >> >> challenge
but something to keep in mind.
> > > >> >> > > >> > > > >> >>
> > > >> >> > > >> > > > >> >> I still
think the simplicity/reliability
> > tradeoff
> > > is
> > > >> >> > > acceptable
> > > >> >> > > >> > > here
> > > >> >> > > >> > > > >> >> if we
rely on external service to abort
> > > heartbeats in
> > > >> >> case
> > > >> >> > > of a
> > > >> >> > > >> > > > health
> > > >> >> > > >> > > > >> >> alert
fired. This can be explicitly documented
> > as
> > > an
> > > >> >> > > external
> > > >> >> > > >> > > > >> >> integration
requirement. However, If the
> > consensus
> > > >> is to
> > > >> >> > go
> > > >> >> > > a
> > > >> >> > > >> > more
> > > >> >> > > >> > > > >> >> reliable
(though more complicated) NACK route
> I
> > am
> > > >> happy
> > > >> >> > to
> > > >> >> > > >> > > > reconsider
> > > >> >> > > >> > > > >> >> the current
proposal.
> > > >> >> > > >> > > > >> >>
> > > >> >> > > >> > > > >> >> On Mon,
Oct 13, 2014 at 3:50 PM, Joshua Cohen
> <
> > > >> >> > > >> > > > jcohen@twopensource.com
<javascript:;>>
> > > >> >> > > >> > > > >> >> wrote:
> > > >> >> > > >> > > > >> >> >
"The heratbeatJobUpdate RPC serves as an
> ACK,
> > > but
> > > >> we
> > > >> >> > don't
> > > >> >> > > >> > have a
> > > >> >> > > >> > > > >> NACK.
> > > >> >> > > >> > > > >> >> If
> > > >> >> > > >> > > > >> >> >
we are going to let lack-of-ACK serve as the
> > > NACK,
> > > >> i
> > > >> >> > don't
> > > >> >> > > >> > think
> > > >> >> > > >> > > > it's
> > > >> >> > > >> > > > >> >> safe
> > > >> >> > > >> > > > >> >> >
to resume when we receive another ACK.  In
> > other
> > > >> >> words,
> > > >> >> > a
> > > >> >> > > >> > service
> > > >> >> > > >> > > > >> >> toggling
> > > >> >> > > >> > > > >> >> >
unhealthy might not be deemed safe to
> > proceed."
> > > >> >> > > >> > > > >> >> >
> > > >> >> > > >> > > > >> >> >
Lack-of-ACK is the scenario where
> connectivity
> > > >> between
> > > >> >> > the
> > > >> >> > > >> > > monitor
> > > >> >> > > >> > > > and
> > > >> >> > > >> > > > >> >> the
> > > >> >> > > >> > > > >> >> >
scheduler is unavailable. Shouldn't the NACK
> > > >> scenario
> > > >> >> > > >> > (everything
> > > >> >> > > >> > > > is
> > > >> >> > > >> > > > >> not
> > > >> >> > > >> > > > >> >> >
ok!) be handled by the monitoring service
> > > >> triggering
> > > >> >> an
> > > >> >> > > >> > explicit
> > > >> >> > > >> > > > >> pause?
> > > >> >> > > >> > > > >> >> >
I.e. section 2 should be updated to say
> > > "External
> > > >> >> > service
> > > >> >> > > >> > detects
> > > >> >> > > >> > > > >> service
> > > >> >> > > >> > > > >> >> >
health problems and pauses the update" and
> > > section
> > > >> 4
> > > >> >> > > becomes
> > > >> >> > > >> > the
> > > >> >> > > >> > > > >> current
> > > >> >> > > >> > > > >> >> >
section 2 (i.e. "Should a heartbeat not be
> > > received
> > > >> >> the
> > > >> >> > > >> > scheduler
> > > >> >> > > >> > > > >> pauses
> > > >> >> > > >> > > > >> >> >
the update.").
> > > >> >> > > >> > > > >> >> >
> > > >> >> > > >> > > > >> >> >
I agree that it's unsafe to to resume
> updates
> > > after
> > > >> >> > > >> receiving a
> > > >> >> > > >> > > > >> heartbeat
> > > >> >> > > >> > > > >> >> >
after previously pausing due to a missed
> > > >> heartbeat. In
> > > >> >> > > that
> > > >> >> > > >> > > > scenario
> > > >> >> > > >> > > > >> I'd
> > > >> >> > > >> > > > >> >> >
think we'd want an explicit resumeJobUpdate.
> > If
> > > the
> > > >> >> > > scenario
> > > >> >> > > >> > > we're
> > > >> >> > > >> > > > >> trying
> > > >> >> > > >> > > > >> >> >
to handle is *never* received a heartbeat,
> > > that's a
> > > >> >> > > separate
> > > >> >> > > >> > > > matter,
> > > >> >> > > >> > > > >> in
> > > >> >> > > >> > > > >> >> >
that case unpausing upon receiving the first
> > > >> heartbeat
> > > >> >> > > would
> > > >> >> > > >> > make
> > > >> >> > > >> > > > >> sense,
> > > >> >> > > >> > > > >> >> >
but it feels like that complicates things
> > quite
> > > a
> > > >> bit
> > > >> >> > > (now we
> > > >> >> > > >> > > need
> > > >> >> > > >> > > > to
> > > >> >> > > >> > > > >> >> >
differentiate between heartbeat #1 and
> > hearbeat
> > > >> #N).
> > > >> >> > > >> > > > >> >> >
> > > >> >> > > >> > > > >> >> >
On Mon, Oct 13, 2014 at 2:50 PM, Bill
> Farner <
> > > >> >> > > >> > wfarner@apache.org <javascript:;>
> > > >> >> > > >> > > >
> > > >> >> > > >> > > > >> wrote:
> > > >> >> > > >> > > > >> >> >
> > > >> >> > > >> > > > >> >> >>
What is the guidance for deploying while
> the
> > > >> >> heartbeat
> > > >> >> > > >> service
> > > >> >> > > >> > > is
> > > >> >> > > >> > > > >> >> broken?
> > > >> >> > > >> > > > >> >> >>
I think i know the answer, but it's
> important
> > > to
> > > >> >> spell
> > > >> >> > > out.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> Create a new coordinated job update in a
> > > paused
> > > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED)
> > > >> >> > > >> > > > >> >> >>
> state to avoid any progress until the
> first
> > > >> >> heartbeat
> > > >> >> > > call
> > > >> >> > > >> > > > arrives.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
I'm not sold on this being ultimately
> > > >> beneficial.  In
> > > >> >> > the
> > > >> >> > > >> > worst
> > > >> >> > > >> > > > case,
> > > >> >> > > >> > > > >> >> >>
impact is still limited by the health check
> > > >> >> threshold.
> > > >> >> > > >> Seems
> > > >> >> > > >> > > like
> > > >> >> > > >> > > > >> >> >>
premature optimization at best, and an odd
> > one
> > > if
> > > >> we
> > > >> >> > > proceed
> > > >> >> > > >> > > > without
> > > >> >> > > >> > > > >> a
> > > >> >> > > >> > > > >> >> >>
'NACK' signal via the heartbeatJobUpdate
> RPC.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
Allow resuming of the
> > > paused-due-to-no-heartbeat
> > > >> >> update
> > > >> >> > > via
> > > >> >> > > >> a
> > > >> >> > > >> > > > >> >> >>
> resumeJobUpdate call.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
Are heartbeats required while rolling back?
> > If
> > > >> so,
> > > >> >> > that
> > > >> >> > > >> might
> > > >> >> > > >> > > > impact
> > > >> >> > > >> > > > >> >> the
> > > >> >> > > >> > > > >> >> >>
design here and in other places.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
Allow resuming of the
> > > paused-due-to-no-heartbeat
> > > >> >> update
> > > >> >> > > via
> > > >> >> > > >> a
> > > >> >> > > >> > > > fresh
> > > >> >> > > >> > > > >> >> >>
> heartbeatJobUpdate call.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
The heratbeatJobUpdate RPC serves as an
> ACK,
> > > but
> > > >> we
> > > >> >> > don't
> > > >> >> > > >> > have a
> > > >> >> > > >> > > > >> NACK.
> > > >> >> > > >> > > > >> >> If
> > > >> >> > > >> > > > >> >> >>
we are going to let lack-of-ACK serve as
> the
> > > >> NACK, i
> > > >> >> > > don't
> > > >> >> > > >> > think
> > > >> >> > > >> > > > it's
> > > >> >> > > >> > > > >> >> safe
> > > >> >> > > >> > > > >> >> >>
to resume when we receive another ACK.  In
> > > other
> > > >> >> > words, a
> > > >> >> > > >> > > service
> > > >> >> > > >> > > > >> >> toggling
> > > >> >> > > >> > > > >> >> >>
unhealthy might not be deemed safe to
> > proceed.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
Perhaps just sending OK (or a NOOP
> > equivalent)
> > > in
> > > >> >> case
> > > >> >> > > of a
> > > >> >> > > >> > > > >> user-paused
> > > >> >> > > >> > > > >> >> job
> > > >> >> > > >> > > > >> >> >>
> update would make more sense as there is
> > > nothing
> > > >> >> > > >> monitoring
> > > >> >> > > >> > > > service
> > > >> >> > > >> > > > >> >> could
> > > >> >> > > >> > > > >> >> >>
> do in that case. This should work fine
> with
> > > >> >> > > pause/resume
> > > >> >> > > >> > > > >> >> -aware/-agnostic
> > > >> >> > > >> > > > >> >> >>
> monitoring service implementation.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
This seems reasonable to me - heartbeats
> for
> > a
> > > >> paused
> > > >> >> > > update
> > > >> >> > > >> > > > should
> > > >> >> > > >> > > > >> not
> > > >> >> > > >> > > > >> >> >>
pose a risk, but can be safely ignored.
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
-=Bill
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
On Mon, Oct 13, 2014 at 12:48 PM, Maxim
> > > >> Khutornenko <
> > > >> >> > > >> > > > >> maxim@apache.org
<javascript:;>>
> > > >> >> > > >> > > > >> >> >>
wrote:
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >> >>
> Agreed. That would be a logical
> > > generalization
> > > >> of
> > > >> >> the
> > > >> >> > > post
> > > >> >> > > >> > > > failover
> > > >> >> > > >> > > > >> >> >>
> behavior.
> > > >> >> > > >> > > > >> >> >>
>
> > > >> >> > > >> > > > >> >> >>
> I have updated the above document with
> the
> > > >> >> following
> > > >> >> > > >> > changes:
> > > >> >> > > >> > > > >> >> >>
> - Reply with PAUSED any time a job was
> > > paused by
> > > >> >> > user;
> > > >> >> > > >> > > > >> >> >>
> - Start in paused state by default.
> > > >> >> > > >> > > > >> >> >>
>
> > > >> >> > > >> > > > >> >> >>
> On Mon, Oct 13, 2014 at 11:32 AM, Kevin
> > > Sweeney
> > > >> <
> > > >> >> > > >> > > > >> kevints@apache.org
<javascript:;>>
> > > >> >> > > >> > > > >> >> >>
> wrote:
> > > >> >> > > >> > > > >> >> >>
> > The doc mentioned that the scheduler
> will
> > > >> start
> > > >> >> an
> > > >> >> > > >> update
> > > >> >> > > >> > > > >> subject to
> > > >> >> > > >> > > > >> >> >>
the
> > > >> >> > > >> > > > >> >> >>
> > heartbeat countdown, and if it doesn't
> > > >> receive a
> > > >> >> > > >> heartbeat
> > > >> >> > > >> > > it
> > > >> >> > > >> > > > >> will
> > > >> >> > > >> > > > >> >> >>
pause
> > > >> >> > > >> > > > >> >> >>
> > the update. Why not start with the
> update
> > > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat
to
> > > >> >> > > >> > > > >> >> >>
> > fail-fast any connectivity issues
> between
> > > the
> > > >> >> > service
> > > >> >> > > >> > > > providing
> > > >> >> > > >> > > > >> the
> > > >> >> > > >> > > > >> >> >>
> > heartbeats and the scheduler?
> > > >> >> > > >> > > > >> >> >>
> >
> > > >> >> > > >> > > > >> >> >>
> > On Fri, Oct 10, 2014 at 12:47 PM, Maxim
> > > >> >> > Khutornenko <
> > > >> >> > > >> > > > >> >> maxim@apache.org
<javascript:;>>
> > > >> >> > > >> > > > >> >> >>
> > wrote:
> > > >> >> > > >> > > > >> >> >>
> >
> > > >> >> > > >> > > > >> >> >>
> >> Hi all,
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
> >> We are proposing a new feature for the
> > > >> scheduler
> > > >> >> > > >> updater,
> > > >> >> > > >> > > > which
> > > >> >> > > >> > > > >> you
> > > >> >> > > >> > > > >> >> >>
> >> may find helpful.
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
> >> I have posed a brief feature summary
> > here:
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >>
> > > >> >> > > >> > > > >>
> > > >> >> > > >> > > >
> > > >> >> > > >> > >
> > > >> >> > > >> >
> > > >> >> > > >>
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
> >> Please, reply with your
> > > >> >> > feedback/concerns/comments.
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
> >> Thanks,
> > > >> >> > > >> > > > >> >> >>
> >> Maxim
> > > >> >> > > >> > > > >> >> >>
> >>
> > > >> >> > > >> > > > >> >> >>
>
> > > >> >> > > >> > > > >> >> >>
> > > >> >> > > >> > > > >> >>
> > > >> >> > > >> > > > >>
> > > >> >> > > >> > > >
> > > >> >> > > >> > >
> > > >> >> > > >> >
> > > >> >> > > >>
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Kevin Sweeney
> > > >> >> @kts
> > > >> >>
> > > >>
> > >
> >
> >
> > --
> > -=Bill
> >
>



-- 
Kevin Sweeney
@kts

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message