aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cohen <jco...@twopensource.com>
Subject Re: Proposal: External Update Coordination
Date Thu, 16 Oct 2014 17:48:18 GMT
What are our thoughts about authentication with regards to heartbeats? It
seems like they should be authenticated since there does exist the
potential for a malicious actor to send its own heartbeats even if the real
monitoring service has detected a problem and ceased sending heartbeats.
I'm not sure exactly how large the attack surface is (if the service is
truly down the scheduler would detect that and roll back the update
regardless), but I think it's worth discussing as we work on the initial
design.

On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfarner@apache.org> wrote:

> David - the plan is to synthesize the waiting state.  Exactly how is not
> yet certain.
>
> On Wednesday, October 15, 2014, Maxim Khutornenko <maxim@apache.org>
> wrote:
>
> > It is certainly possible to add new state or a status message but I
> > don't think it's a blocker for the first iteration. Provided there is
> > enough demand a state/message could be synthesized during the 'get'
> > call based on the volatile state.
> >
> > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin <david@dmclaughlin.com
> > <javascript:;>> wrote:
> > > +1 for pause being explicit RPC pauses, but does it really add
> complexity
> > > to just add a new state (WAITING?) when no heartbeat is sent? Not being
> > > able to see that an update was blocked because of a lack of heartbeat
> > seems
> > > like a missing feature.
> > >
> > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <maxim@apache.org
> > <javascript:;>> wrote:
> > >
> > >> +1. Updated the doc:
> > >>
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >>
> > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfarner@apache.org
> > <javascript:;>> wrote:
> > >> > +1 to the scheduler not proceeding on an update when heartbeats are
> > >> absent,
> > >> > and requiring the heartbeat service to explicitly call
> pauseJobUpdate
> > >> when
> > >> > it detects problems.
> > >> >
> > >> > -=Bill
> > >> >
> > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney
> > >> <ksweeney@twitter.com.invalid
> > >> >> wrote:
> > >> >
> > >> >> Chatted with Maxim and Bill, I think we figured it out
> > >> >>
> > >> >> I think the confusion stems from the fact that there are two types
> of
> > >> >> pauses in this system, explicit, persisted pauses generated by
the
> > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to
the
> > >> absence
> > >> >> of a sufficiently fresh heartbeat (such as in the case of a network
> > >> >> partition).
> > >> >>
> > >> >> In case a monitoring service detects a problem it should call
the
> > >> explicit
> > >> >> pauseJobUpdate RPC, which will cause a state change that requires
> an
> > >> >> explicit resumeJobUpdate RPC to resume. That feature already
> exists.
> > >> >>
> > >> >> But, we need one more thing to make this reliable - heartbeats
to
> > >> protect
> > >> >> against network partitions between the scheduler and the monitoring
> > >> >> service. These can be volatile and lightweight - the scheduler
just
> > >> checks
> > >> >> for a sufficiently fresh heartbeat before it performs an update
> > action,
> > >> and
> > >> >> if none is present it simply refuses to perform the action. If
the
> > >> >> partition heals a new heartbeat will arrive (if the update being
> > >> monitored
> > >> >> should still be allowed to proceed) and the scheduler will allow
> the
> > >> update
> > >> >> to proceed.
> > >> >>
> > >> >>
> > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfarner@apache.org
> > <javascript:;>>
> > >> wrote:
> > >> >>
> > >> >> > I think we should assess that after building the rest of
the
> > feature.
> > >> >> IIUC
> > >> >> > the rest of the code doesn't care if the update is initially
> > paused.
> > >> >> >
> > >> >> > -=Bill
> > >> >> >
> > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <
> > maxim@apache.org <javascript:;>
> > >> >
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Can we get a consensus here? Looks like the only sticky
point
> > left
> > >> is
> > >> >> > > around starting an update in paused vs. non-paused state.
I can
> > >> argue
> > >> >> > > either way as it's easy to add later if needed.
> > >> >> > >
> > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <
> wfarner@apache.org
> > <javascript:;>>
> > >> >> wrote:
> > >> >> > > > I'm not arguing against the merits of the approach.
 Just
> > feeling
> > >> out
> > >> >> > > > whether that should be done _after_ the rest of
the heartbeat
> > >> >> support.
> > >> >> > > > Seems like it can be cleanly added at the end to
get
> something
> > >> usable
> > >> >> > > > earlier.
> > >> >> > > >
> > >> >> > > > -=Bill
> > >> >> > > >
> > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney
<
> > >> kevints@apache.org <javascript:;>>
> > >> >> > > wrote:
> > >> >> > > >
> > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform
> > >> >> unknown-or-unhealthy
> > >> >> > > >> signal, and punting on a more complex NACK
signal (which
> we'd
> > >> have
> > >> >> to
> > >> >> > > >> reliably persist).
> > >> >> > > >>
> > >> >> > > >> I think the only disagreement in this thread
is whether the
> > >> default
> > >> >> > > state
> > >> >> > > >> for a new update should be running or
> waiting-for-heartbeat. I
> > >> think
> > >> >> > > >> waiting for a heartbeat is not only a more
correct
> > implementation
> > >> >> (no
> > >> >> > > risk
> > >> >> > > >> of acting after a failover but before the heartbeat
timeout)
> > but
> > >> >> > > simpler to
> > >> >> > > >> implement (initialize the PulseMonitor data
structure as
> empty
> > >> >> rather
> > >> >> > > than
> > >> >> > > >> with a synthetic heartbeat).
> > >> >> > > >>
> > >> >> > > >> From an API consumer perspective the sequence
is:
> > >> >> > > >>
> > >> >> > > >> 1. API client sends a startUpdate RPC to the
scheduler
> > >> >> > > >> 2. API client receives an OK response, then
arranges for
> > >> something
> > >> >> to
> > >> >> > > call
> > >> >> > > >> heartbeat with that updateId on some interval
> > >> >> > > >> 3. Whatever is supposed to send heartbeats
sends one
> > immediately,
> > >> >> then
> > >> >> > > >> starts sending them on some smaller interval
> > >> >> > > >>
> > >> >> > > >> Waiting for the first heartbeat ensures that
this sequence
> has
> > >> been
> > >> >> > > >> completed successfully, while not waiting for
it only ensure
> > that
> > >> >> > step 1
> > >> >> > > >> has happened.
> > >> >> > > >>
> > >> >> > > >>
> > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner
<
> > >> wfarner@apache.org <javascript:;>>
> > >> >> > > wrote:
> > >> >> > > >>
> > >> >> > > >> > Wait - simpler solution than what?  We're
talking about
> not
> > >> doing
> > >> >> > > either.
> > >> >> > > >> >
> > >> >> > > >> > -=Bill
> > >> >> > > >> >
> > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin
Sweeney <
> > >> >> kevints@apache.org <javascript:;>
> > >> >> > >
> > >> >> > > >> > wrote:
> > >> >> > > >> >
> > >> >> > > >> > > I think waiting for the first heartbeat
before taking
> any
> > >> action
> > >> >> > is
> > >> >> > > the
> > >> >> > > >> > > simpler solution here as it allows
the implementation to
> > be
> > >> >> > entirely
> > >> >> > > >> > > soft-state and still catches the
bugs I described.
> > >> >> > > >> > >
> > >> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId>
-
> > >> >> heartbeat
> > >> >> > > calls
> > >> >> > > >> > > pulse and mutation operations check
isAlive. I think the
> > code
> > >> >> > might
> > >> >> > > >> > > actually work as-is.
> > >> >> > > >> > >
> > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM,
Maxim Khutornenko <
> > >> >> > > maxim@apache.org <javascript:;>>
> > >> >> > > >> > > wrote:
> > >> >> > > >> > >
> > >> >> > > >> > > > Pausing update on creation seems
like a logical
> approach
> > >> when
> > >> >> > > dealing
> > >> >> > > >> > > > with inverted dependency model.
I.e. updater is happy
> to
> > >> act
> > >> >> as
> > >> >> > > long
> > >> >> > > >> > > > as it's greenlighted by the
external signal. It's also
> > >> aligned
> > >> >> > > with a
> > >> >> > > >> > > > failover experience where coordinated
updates are
> > >> rehydrated
> > >> >> in
> > >> >> > > >> paused
> > >> >> > > >> > > > state waiting for HB awakening.
That said, I am OK
> > punting
> > >> it
> > >> >> > for
> > >> >> > > the
> > >> >> > > >> > > > sake of simplicity for now.
> > >> >> > > >> > > >
> > >> >> > > >> > > > Kevin?
> > >> >> > > >> > > >
> > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05
PM, Bill Farner <
> > >> >> > wfarner@apache.org <javascript:;>
> > >> >> > > >
> > >> >> > > >> > > wrote:
> > >> >> > > >> > > > > If the goal is to reduce
complexity now and add
> > features
> > >> >> > later,
> > >> >> > > why
> > >> >> > > >> > not
> > >> >> > > >> > > > > nuke both for now - kick
off the update right away,
> > and
> > >> let
> > >> >> > > lack of
> > >> >> > > >> > > > > heartbeats serve as a uniform
"unknown or unhealthy"
> > >> signal?
> > >> >> > > >> > > > >
> > >> >> > > >> > > > > -=Bill
> > >> >> > > >> > > > >
> > >> >> > > >> > > > > On Mon, Oct 13, 2014 at
5:25 PM, Maxim Khutornenko <
> > >> >> > > >> maxim@apache.org <javascript:;>
> > >> >> > > >> > >
> > >> >> > > >> > > > wrote:
> > >> >> > > >> > > > >
> > >> >> > > >> > > > >> I am still +1 on the
idea to have default paused
> > state
> > >> on
> > >> >> > > >> creation.
> > >> >> > > >> > I
> > >> >> > > >> > > > >> think we could still
differentiate between
> initially
> > >> paused
> > >> >> > and
> > >> >> > > >> > timed
> > >> >> > > >> > > > >> out states internally
by looking at pause reason.
> > It's
> > >> >> quite
> > >> >> > > >> > different
> > >> >> > > >> > > > >> if we want to store
explicit NACK reasons from the
> > >> external
> > >> >> > > >> service
> > >> >> > > >> > > > >> though. That would
require persistence and a bit
> more
> > >> >> > > complicated
> > >> >> > > >> > > > >> logic.
> > >> >> > > >> > > > >>
> > >> >> > > >> > > > >> On Mon, Oct 13, 2014
at 5:15 PM, Kevin Sweeney <
> > >> >> > > >> kevints@apache.org <javascript:;>>
> > >> >> > > >> > > > wrote:
> > >> >> > > >> > > > >> > I like the idea
of implementing this
> scheduler-side
> > >> >> purely
> > >> >> > > >> through
> > >> >> > > >> > > > >> volatile
> > >> >> > > >> > > > >> > state, but the
lack of feedback (generic vs
> > specific
> > >> >> error
> > >> >> > > >> > messages
> > >> >> > > >> > > > when
> > >> >> > > >> > > > >> an
> > >> >> > > >> > > > >> > update is paused)
leaves something to be desired.
> > >> Maybe
> > >> >> we
> > >> >> > > can
> > >> >> > > >> > > address
> > >> >> > > >> > > > >> that
> > >> >> > > >> > > > >> > with a metadata
field in the initial call to
> > >> startUpdate
> > >> >> > > (with
> > >> >> > > >> an
> > >> >> > > >> > > > >> optional
> > >> >> > > >> > > > >> > link to a page
where one can get more rich
> > information
> > >> >> > about
> > >> >> > > the
> > >> >> > > >> > > > state of
> > >> >> > > >> > > > >> > the monitor sending/not
sending heartbeats).
> > >> >> > > >> > > > >> >
> > >> >> > > >> > > > >> > The main drawback
is that we may have to wait a
> > >> maximum
> > >> >> of
> > >> >> > > one
> > >> >> > > >> > > > heartbeat
> > >> >> > > >> > > > >> > interval to find
out that an update should be
> > paused.
> > >> >> > > >> > > > >> >
> > >> >> > > >> > > > >> > On Mon, Oct 13,
2014 at 4:55 PM, Maxim
> Khutornenko
> > <
> > >> >> > > >> > > maxim@apache.org <javascript:;>>
> > >> >> > > >> > > > >> wrote:
> > >> >> > > >> > > > >> >
> > >> >> > > >> > > > >> >> The main reason
I preferred the lack-of-ACK
> > approach
> > >> >> over
> > >> >> > an
> > >> >> > > >> > > explicit
> > >> >> > > >> > > > >> >> NACK one is
simplicity. As Joshua pointed out
> > there
> > >> is
> > >> >> > more
> > >> >> > > >> state
> > >> >> > > >> > > to
> > >> >> > > >> > > > >> >> handle in
that case. The lack-of-ACK model can
> be
> > >> >> > completely
> > >> >> > > >> > > > >> >> implemented
in volatile memory sidestepping the
> > >> >> persistent
> > >> >> > > >> > storage
> > >> >> > > >> > > > >> >> entirely.
With the NACK we would need to
> reliably
> > >> >> persist
> > >> >> > > >> > external
> > >> >> > > >> > > > >> >> service call
reasons to survive scheduler
> > failovers.
> > >> >> Not a
> > >> >> > > huge
> > >> >> > > >> > > > >> >> challenge
but something to keep in mind.
> > >> >> > > >> > > > >> >>
> > >> >> > > >> > > > >> >> I still think
the simplicity/reliability
> tradeoff
> > is
> > >> >> > > acceptable
> > >> >> > > >> > > here
> > >> >> > > >> > > > >> >> if we rely
on external service to abort
> > heartbeats in
> > >> >> case
> > >> >> > > of a
> > >> >> > > >> > > > health
> > >> >> > > >> > > > >> >> alert fired.
This can be explicitly documented
> as
> > an
> > >> >> > > external
> > >> >> > > >> > > > >> >> integration
requirement. However, If the
> consensus
> > >> is to
> > >> >> > go
> > >> >> > > a
> > >> >> > > >> > more
> > >> >> > > >> > > > >> >> reliable (though
more complicated) NACK route I
> am
> > >> happy
> > >> >> > to
> > >> >> > > >> > > > reconsider
> > >> >> > > >> > > > >> >> the current
proposal.
> > >> >> > > >> > > > >> >>
> > >> >> > > >> > > > >> >> On Mon, Oct
13, 2014 at 3:50 PM, Joshua Cohen <
> > >> >> > > >> > > > jcohen@twopensource.com <javascript:;>>
> > >> >> > > >> > > > >> >> wrote:
> > >> >> > > >> > > > >> >> > "The
heratbeatJobUpdate RPC serves as an ACK,
> > but
> > >> we
> > >> >> > don't
> > >> >> > > >> > have a
> > >> >> > > >> > > > >> NACK.
> > >> >> > > >> > > > >> >> If
> > >> >> > > >> > > > >> >> > we are
going to let lack-of-ACK serve as the
> > NACK,
> > >> i
> > >> >> > don't
> > >> >> > > >> > think
> > >> >> > > >> > > > it's
> > >> >> > > >> > > > >> >> safe
> > >> >> > > >> > > > >> >> > to resume
when we receive another ACK.  In
> other
> > >> >> words,
> > >> >> > a
> > >> >> > > >> > service
> > >> >> > > >> > > > >> >> toggling
> > >> >> > > >> > > > >> >> > unhealthy
might not be deemed safe to
> proceed."
> > >> >> > > >> > > > >> >> >
> > >> >> > > >> > > > >> >> > Lack-of-ACK
is the scenario where connectivity
> > >> between
> > >> >> > the
> > >> >> > > >> > > monitor
> > >> >> > > >> > > > and
> > >> >> > > >> > > > >> >> the
> > >> >> > > >> > > > >> >> > scheduler
is unavailable. Shouldn't the NACK
> > >> scenario
> > >> >> > > >> > (everything
> > >> >> > > >> > > > is
> > >> >> > > >> > > > >> not
> > >> >> > > >> > > > >> >> > ok!)
be handled by the monitoring service
> > >> triggering
> > >> >> an
> > >> >> > > >> > explicit
> > >> >> > > >> > > > >> pause?
> > >> >> > > >> > > > >> >> > I.e.
section 2 should be updated to say
> > "External
> > >> >> > service
> > >> >> > > >> > detects
> > >> >> > > >> > > > >> service
> > >> >> > > >> > > > >> >> > health
problems and pauses the update" and
> > section
> > >> 4
> > >> >> > > becomes
> > >> >> > > >> > the
> > >> >> > > >> > > > >> current
> > >> >> > > >> > > > >> >> > section
2 (i.e. "Should a heartbeat not be
> > received
> > >> >> the
> > >> >> > > >> > scheduler
> > >> >> > > >> > > > >> pauses
> > >> >> > > >> > > > >> >> > the update.").
> > >> >> > > >> > > > >> >> >
> > >> >> > > >> > > > >> >> > I agree
that it's unsafe to to resume updates
> > after
> > >> >> > > >> receiving a
> > >> >> > > >> > > > >> heartbeat
> > >> >> > > >> > > > >> >> > after
previously pausing due to a missed
> > >> heartbeat. In
> > >> >> > > that
> > >> >> > > >> > > > scenario
> > >> >> > > >> > > > >> I'd
> > >> >> > > >> > > > >> >> > think
we'd want an explicit resumeJobUpdate.
> If
> > the
> > >> >> > > scenario
> > >> >> > > >> > > we're
> > >> >> > > >> > > > >> trying
> > >> >> > > >> > > > >> >> > to handle
is *never* received a heartbeat,
> > that's a
> > >> >> > > separate
> > >> >> > > >> > > > matter,
> > >> >> > > >> > > > >> in
> > >> >> > > >> > > > >> >> > that
case unpausing upon receiving the first
> > >> heartbeat
> > >> >> > > would
> > >> >> > > >> > make
> > >> >> > > >> > > > >> sense,
> > >> >> > > >> > > > >> >> > but it
feels like that complicates things
> quite
> > a
> > >> bit
> > >> >> > > (now we
> > >> >> > > >> > > need
> > >> >> > > >> > > > to
> > >> >> > > >> > > > >> >> > differentiate
between heartbeat #1 and
> hearbeat
> > >> #N).
> > >> >> > > >> > > > >> >> >
> > >> >> > > >> > > > >> >> > On Mon,
Oct 13, 2014 at 2:50 PM, Bill Farner <
> > >> >> > > >> > wfarner@apache.org <javascript:;>
> > >> >> > > >> > > >
> > >> >> > > >> > > > >> wrote:
> > >> >> > > >> > > > >> >> >
> > >> >> > > >> > > > >> >> >> What
is the guidance for deploying while the
> > >> >> heartbeat
> > >> >> > > >> service
> > >> >> > > >> > > is
> > >> >> > > >> > > > >> >> broken?
> > >> >> > > >> > > > >> >> >> I
think i know the answer, but it's important
> > to
> > >> >> spell
> > >> >> > > out.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> >
Create a new coordinated job update in a
> > paused
> > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED)
> > >> >> > > >> > > > >> >> >> >
state to avoid any progress until the first
> > >> >> heartbeat
> > >> >> > > call
> > >> >> > > >> > > > arrives.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> I'm
not sold on this being ultimately
> > >> beneficial.  In
> > >> >> > the
> > >> >> > > >> > worst
> > >> >> > > >> > > > case,
> > >> >> > > >> > > > >> >> >> impact
is still limited by the health check
> > >> >> threshold.
> > >> >> > > >> Seems
> > >> >> > > >> > > like
> > >> >> > > >> > > > >> >> >> premature
optimization at best, and an odd
> one
> > if
> > >> we
> > >> >> > > proceed
> > >> >> > > >> > > > without
> > >> >> > > >> > > > >> a
> > >> >> > > >> > > > >> >> >> 'NACK'
signal via the heartbeatJobUpdate RPC.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> Allow
resuming of the
> > paused-due-to-no-heartbeat
> > >> >> update
> > >> >> > > via
> > >> >> > > >> a
> > >> >> > > >> > > > >> >> >> >
resumeJobUpdate call.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> Are
heartbeats required while rolling back?
> If
> > >> so,
> > >> >> > that
> > >> >> > > >> might
> > >> >> > > >> > > > impact
> > >> >> > > >> > > > >> >> the
> > >> >> > > >> > > > >> >> >> design
here and in other places.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> Allow
resuming of the
> > paused-due-to-no-heartbeat
> > >> >> update
> > >> >> > > via
> > >> >> > > >> a
> > >> >> > > >> > > > fresh
> > >> >> > > >> > > > >> >> >> >
heartbeatJobUpdate call.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> The
heratbeatJobUpdate RPC serves as an ACK,
> > but
> > >> we
> > >> >> > don't
> > >> >> > > >> > have a
> > >> >> > > >> > > > >> NACK.
> > >> >> > > >> > > > >> >> If
> > >> >> > > >> > > > >> >> >> we
are going to let lack-of-ACK serve as the
> > >> NACK, i
> > >> >> > > don't
> > >> >> > > >> > think
> > >> >> > > >> > > > it's
> > >> >> > > >> > > > >> >> safe
> > >> >> > > >> > > > >> >> >> to
resume when we receive another ACK.  In
> > other
> > >> >> > words, a
> > >> >> > > >> > > service
> > >> >> > > >> > > > >> >> toggling
> > >> >> > > >> > > > >> >> >> unhealthy
might not be deemed safe to
> proceed.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> Perhaps
just sending OK (or a NOOP
> equivalent)
> > in
> > >> >> case
> > >> >> > > of a
> > >> >> > > >> > > > >> user-paused
> > >> >> > > >> > > > >> >> job
> > >> >> > > >> > > > >> >> >> >
update would make more sense as there is
> > nothing
> > >> >> > > >> monitoring
> > >> >> > > >> > > > service
> > >> >> > > >> > > > >> >> could
> > >> >> > > >> > > > >> >> >> >
do in that case. This should work fine with
> > >> >> > > pause/resume
> > >> >> > > >> > > > >> >> -aware/-agnostic
> > >> >> > > >> > > > >> >> >> >
monitoring service implementation.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> This
seems reasonable to me - heartbeats for
> a
> > >> paused
> > >> >> > > update
> > >> >> > > >> > > > should
> > >> >> > > >> > > > >> not
> > >> >> > > >> > > > >> >> >> pose
a risk, but can be safely ignored.
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> -=Bill
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> On
Mon, Oct 13, 2014 at 12:48 PM, Maxim
> > >> Khutornenko <
> > >> >> > > >> > > > >> maxim@apache.org <javascript:;>>
> > >> >> > > >> > > > >> >> >> wrote:
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >> >> >
Agreed. That would be a logical
> > generalization
> > >> of
> > >> >> the
> > >> >> > > post
> > >> >> > > >> > > > failover
> > >> >> > > >> > > > >> >> >> >
behavior.
> > >> >> > > >> > > > >> >> >> >
> > >> >> > > >> > > > >> >> >> >
I have updated the above document with the
> > >> >> following
> > >> >> > > >> > changes:
> > >> >> > > >> > > > >> >> >> >
- Reply with PAUSED any time a job was
> > paused by
> > >> >> > user;
> > >> >> > > >> > > > >> >> >> >
- Start in paused state by default.
> > >> >> > > >> > > > >> >> >> >
> > >> >> > > >> > > > >> >> >> >
On Mon, Oct 13, 2014 at 11:32 AM, Kevin
> > Sweeney
> > >> <
> > >> >> > > >> > > > >> kevints@apache.org
<javascript:;>>
> > >> >> > > >> > > > >> >> >> >
wrote:
> > >> >> > > >> > > > >> >> >> >
> The doc mentioned that the scheduler will
> > >> start
> > >> >> an
> > >> >> > > >> update
> > >> >> > > >> > > > >> subject to
> > >> >> > > >> > > > >> >> >> the
> > >> >> > > >> > > > >> >> >> >
> heartbeat countdown, and if it doesn't
> > >> receive a
> > >> >> > > >> heartbeat
> > >> >> > > >> > > it
> > >> >> > > >> > > > >> will
> > >> >> > > >> > > > >> >> >> pause
> > >> >> > > >> > > > >> >> >> >
> the update. Why not start with the update
> > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat
to
> > >> >> > > >> > > > >> >> >> >
> fail-fast any connectivity issues between
> > the
> > >> >> > service
> > >> >> > > >> > > > providing
> > >> >> > > >> > > > >> the
> > >> >> > > >> > > > >> >> >> >
> heartbeats and the scheduler?
> > >> >> > > >> > > > >> >> >> >
>
> > >> >> > > >> > > > >> >> >> >
> On Fri, Oct 10, 2014 at 12:47 PM, Maxim
> > >> >> > Khutornenko <
> > >> >> > > >> > > > >> >> maxim@apache.org
<javascript:;>>
> > >> >> > > >> > > > >> >> >> >
> wrote:
> > >> >> > > >> > > > >> >> >> >
>
> > >> >> > > >> > > > >> >> >> >
>> Hi all,
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
>> We are proposing a new feature for the
> > >> scheduler
> > >> >> > > >> updater,
> > >> >> > > >> > > > which
> > >> >> > > >> > > > >> you
> > >> >> > > >> > > > >> >> >> >
>> may find helpful.
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
>> I have posed a brief feature summary
> here:
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >>
> > >> >> > > >> > > > >>
> > >> >> > > >> > > >
> > >> >> > > >> > >
> > >> >> > > >> >
> > >> >> > > >>
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
>> Please, reply with your
> > >> >> > feedback/concerns/comments.
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
>> Thanks,
> > >> >> > > >> > > > >> >> >> >
>> Maxim
> > >> >> > > >> > > > >> >> >> >
>>
> > >> >> > > >> > > > >> >> >> >
> > >> >> > > >> > > > >> >> >>
> > >> >> > > >> > > > >> >>
> > >> >> > > >> > > > >>
> > >> >> > > >> > > >
> > >> >> > > >> > >
> > >> >> > > >> >
> > >> >> > > >>
> > >> >> > >
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Kevin Sweeney
> > >> >> @kts
> > >> >>
> > >>
> >
>
>
> --
> -=Bill
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message