Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@aurora.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAOTkfX5mSdbwhoMnbCFp4CzTM_ye1xmoBKg3YsRmchJ7-ipizQ@mail.gmail.com>
References: 
 <CAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A@mail.gmail.com>
	<CAAATh-a+zWyw+p4beS0oCvo9+URjaWSn7Bg0ZEh0vqb4Xyndhg@mail.gmail.com>
	<CAOTkfX4OiQ=FPMuHVprvdpYOn4PC3hCDv5xdeSeKEc6qCddC8Q@mail.gmail.com>
	<CAGRA8uNk-3_jBJxe2t9c0W+Ac-Xm29hqHxs9pNBsT4+xooUktA@mail.gmail.com>
	<CAMnduq9=WJhX+UGi+2Cz+xFsu40P6GCpn_wQXj4c4xUViK1Dow@mail.gmail.com>
	<CAOTkfX5mSdbwhoMnbCFp4CzTM_ye1xmoBKg3YsRmchJ7-ipizQ@mail.gmail.com>
Date: Mon, 13 Oct 2014 17:15:38 -0700
Message-ID: 
 <CAAATh-Yqt2a7yq+7+mHmrVV6jXNBrfBdB8M5WrWwUvzLVGmrnw@mail.gmail.com>
Subject: Re: Proposal: External Update Coordination
From: Kevin Sweeney <kevints@apache.org>
To: Aurora <dev@aurora.incubator.apache.org>
Content-Type: multipart/alternative; boundary=f46d0438954d355bb0050556ea57

--f46d0438954d355bb0050556ea57
Content-Type: text/plain; charset=UTF-8

I like the idea of implementing this scheduler-side purely through volatile
state, but the lack of feedback (generic vs specific error messages when an
update is paused) leaves something to be desired. Maybe we can address that
with a metadata field in the initial call to startUpdate (with an optional
link to a page where one can get more rich information about the state of
the monitor sending/not sending heartbeats).

The main drawback is that we may have to wait a maximum of one heartbeat
interval to find out that an update should be paused.

On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <maxim@apache.org> wrote:

> The main reason I preferred the lack-of-ACK approach over an explicit
> NACK one is simplicity. As Joshua pointed out there is more state to
> handle in that case. The lack-of-ACK model can be completely
> implemented in volatile memory sidestepping the persistent storage
> entirely. With the NACK we would need to reliably persist external
> service call reasons to survive scheduler failovers. Not a huge
> challenge but something to keep in mind.
>
> I still think the simplicity/reliability tradeoff is acceptable here
> if we rely on external service to abort heartbeats in case of a health
> alert fired. This can be explicitly documented as an external
> integration requirement. However, If the consensus is to go a more
> reliable (though more complicated) NACK route I am happy to reconsider
> the current proposal.
>
> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <jcohen@twopensource.com>
> wrote:
> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.
> If
> > we are going to let lack-of-ACK serve as the NACK, i don't think it's
> safe
> > to resume when we receive another ACK.  In other words, a service
> toggling
> > unhealthy might not be deemed safe to proceed."
> >
> > Lack-of-ACK is the scenario where connectivity between the monitor and
> the
> > scheduler is unavailable. Shouldn't the NACK scenario (everything is not
> > ok!) be handled by the monitoring service triggering an explicit pause?
> > I.e. section 2 should be updated to say "External service detects service
> > health problems and pauses the update" and section 4 becomes the current
> > section 2 (i.e. "Should a heartbeat not be received the scheduler pauses
> > the update.").
> >
> > I agree that it's unsafe to to resume updates after receiving a heartbeat
> > after previously pausing due to a missed heartbeat. In that scenario I'd
> > think we'd want an explicit resumeJobUpdate. If the scenario we're trying
> > to handle is *never* received a heartbeat, that's a separate matter, in
> > that case unpausing upon receiving the first heartbeat would make sense,
> > but it feels like that complicates things quite a bit (now we need to
> > differentiate between heartbeat #1 and hearbeat #N).
> >
> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfarner@apache.org> wrote:
> >
> >> What is the guidance for deploying while the heartbeat service is
> broken?
> >> I think i know the answer, but it's important to spell out.
> >>
> >>
> >>
> >> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> >> > state to avoid any progress until the first heartbeat call arrives.
> >>
> >>
> >> I'm not sold on this being ultimately beneficial.  In the worst case,
> >> impact is still limited by the health check threshold.  Seems like
> >> premature optimization at best, and an odd one if we proceed without a
> >> 'NACK' signal via the heartbeatJobUpdate RPC.
> >>
> >> Allow resuming of the paused-due-to-no-heartbeat update via a
> >> > resumeJobUpdate call.
> >>
> >>
> >> Are heartbeats required while rolling back?  If so, that might impact
> the
> >> design here and in other places.
> >>
> >> Allow resuming of the paused-due-to-no-heartbeat update via a fresh
> >> > heartbeatJobUpdate call.
> >>
> >>
> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.
> If
> >> we are going to let lack-of-ACK serve as the NACK, i don't think it's
> safe
> >> to resume when we receive another ACK.  In other words, a service
> toggling
> >> unhealthy might not be deemed safe to proceed.
> >>
> >> Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused
> job
> >> > update would make more sense as there is nothing monitoring service
> could
> >> > do in that case. This should work fine with pause/resume
> -aware/-agnostic
> >> > monitoring service implementation.
> >>
> >>
> >> This seems reasonable to me - heartbeats for a paused update should not
> >> pose a risk, but can be safely ignored.
> >>
> >>
> >>
> >> -=Bill
> >>
> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <maxim@apache.org>
> >> wrote:
> >>
> >> > Agreed. That would be a logical generalization of the post failover
> >> > behavior.
> >> >
> >> > I have updated the above document with the following changes:
> >> > - Reply with PAUSED any time a job was paused by user;
> >> > - Start in paused state by default.
> >> >
> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevints@apache.org>
> >> > wrote:
> >> > > The doc mentioned that the scheduler will start an update subject to
> >> the
> >> > > heartbeat countdown, and if it doesn't receive a heartbeat it will
> >> pause
> >> > > the update. Why not start with the update
> paused-due-to-no-heartbeat to
> >> > > fail-fast any connectivity issues between the service providing the
> >> > > heartbeats and the scheduler?
> >> > >
> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <
> maxim@apache.org>
> >> > > wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> We are proposing a new feature for the scheduler updater, which you
> >> > >> may find helpful.
> >> > >>
> >> > >> I have posed a brief feature summary here:
> >> > >>
> >> > >>
> >> >
> >>
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> >> > >>
> >> > >> Please, reply with your feedback/concerns/comments.
> >> > >>
> >> > >> Thanks,
> >> > >> Maxim
> >> > >>
> >> >
> >>
>

--f46d0438954d355bb0050556ea57--