Return-Path: X-Original-To: apmail-aurora-dev-archive@minotaur.apache.org Delivered-To: apmail-aurora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0808717EBE for ; Tue, 14 Oct 2014 00:16:11 +0000 (UTC) Received: (qmail 9904 invoked by uid 500); 14 Oct 2014 00:16:10 -0000 Delivered-To: apmail-aurora-dev-archive@aurora.apache.org Received: (qmail 9850 invoked by uid 500); 14 Oct 2014 00:16:10 -0000 Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.incubator.apache.org Delivered-To: mailing list dev@aurora.incubator.apache.org Received: (qmail 9839 invoked by uid 99); 14 Oct 2014 00:16:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Oct 2014 00:16:10 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 14 Oct 2014 00:15:46 +0000 Received: (qmail 9666 invoked by uid 99); 14 Oct 2014 00:15:44 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Oct 2014 00:15:44 +0000 Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E914F1A0421 for ; Tue, 14 Oct 2014 00:15:31 +0000 (UTC) Received: by mail-wi0-f176.google.com with SMTP id hi2so8711562wib.9 for ; Mon, 13 Oct 2014 17:15:38 -0700 (PDT) X-Gm-Message-State: ALoCoQkq0EPr53FXdfQ/7CDWzQZo8swthQ0SxO8yb+ZxEH7lx8fJOABAaBjRXj2FBj2NPJeNqfXw MIME-Version: 1.0 X-Received: by 10.180.74.239 with SMTP id x15mr2164482wiv.0.1413245738231; Mon, 13 Oct 2014 17:15:38 -0700 (PDT) Received: by 10.216.113.74 with HTTP; Mon, 13 Oct 2014 17:15:38 -0700 (PDT) In-Reply-To: References: Date: Mon, 13 Oct 2014 17:15:38 -0700 Message-ID: Subject: Re: Proposal: External Update Coordination From: Kevin Sweeney To: Aurora Content-Type: multipart/alternative; boundary=f46d0438954d355bb0050556ea57 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0438954d355bb0050556ea57 Content-Type: text/plain; charset=UTF-8 I like the idea of implementing this scheduler-side purely through volatile state, but the lack of feedback (generic vs specific error messages when an update is paused) leaves something to be desired. Maybe we can address that with a metadata field in the initial call to startUpdate (with an optional link to a page where one can get more rich information about the state of the monitor sending/not sending heartbeats). The main drawback is that we may have to wait a maximum of one heartbeat interval to find out that an update should be paused. On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko wrote: > The main reason I preferred the lack-of-ACK approach over an explicit > NACK one is simplicity. As Joshua pointed out there is more state to > handle in that case. The lack-of-ACK model can be completely > implemented in volatile memory sidestepping the persistent storage > entirely. With the NACK we would need to reliably persist external > service call reasons to survive scheduler failovers. Not a huge > challenge but something to keep in mind. > > I still think the simplicity/reliability tradeoff is acceptable here > if we rely on external service to abort heartbeats in case of a health > alert fired. This can be explicitly documented as an external > integration requirement. However, If the consensus is to go a more > reliable (though more complicated) NACK route I am happy to reconsider > the current proposal. > > On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen > wrote: > > "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. > If > > we are going to let lack-of-ACK serve as the NACK, i don't think it's > safe > > to resume when we receive another ACK. In other words, a service > toggling > > unhealthy might not be deemed safe to proceed." > > > > Lack-of-ACK is the scenario where connectivity between the monitor and > the > > scheduler is unavailable. Shouldn't the NACK scenario (everything is not > > ok!) be handled by the monitoring service triggering an explicit pause? > > I.e. section 2 should be updated to say "External service detects service > > health problems and pauses the update" and section 4 becomes the current > > section 2 (i.e. "Should a heartbeat not be received the scheduler pauses > > the update."). > > > > I agree that it's unsafe to to resume updates after receiving a heartbeat > > after previously pausing due to a missed heartbeat. In that scenario I'd > > think we'd want an explicit resumeJobUpdate. If the scenario we're trying > > to handle is *never* received a heartbeat, that's a separate matter, in > > that case unpausing upon receiving the first heartbeat would make sense, > > but it feels like that complicates things quite a bit (now we need to > > differentiate between heartbeat #1 and hearbeat #N). > > > > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner wrote: > > > >> What is the guidance for deploying while the heartbeat service is > broken? > >> I think i know the answer, but it's important to spell out. > >> > >> > >> > >> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > >> > state to avoid any progress until the first heartbeat call arrives. > >> > >> > >> I'm not sold on this being ultimately beneficial. In the worst case, > >> impact is still limited by the health check threshold. Seems like > >> premature optimization at best, and an odd one if we proceed without a > >> 'NACK' signal via the heartbeatJobUpdate RPC. > >> > >> Allow resuming of the paused-due-to-no-heartbeat update via a > >> > resumeJobUpdate call. > >> > >> > >> Are heartbeats required while rolling back? If so, that might impact > the > >> design here and in other places. > >> > >> Allow resuming of the paused-due-to-no-heartbeat update via a fresh > >> > heartbeatJobUpdate call. > >> > >> > >> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. > If > >> we are going to let lack-of-ACK serve as the NACK, i don't think it's > safe > >> to resume when we receive another ACK. In other words, a service > toggling > >> unhealthy might not be deemed safe to proceed. > >> > >> Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused > job > >> > update would make more sense as there is nothing monitoring service > could > >> > do in that case. This should work fine with pause/resume > -aware/-agnostic > >> > monitoring service implementation. > >> > >> > >> This seems reasonable to me - heartbeats for a paused update should not > >> pose a risk, but can be safely ignored. > >> > >> > >> > >> -=Bill > >> > >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko > >> wrote: > >> > >> > Agreed. That would be a logical generalization of the post failover > >> > behavior. > >> > > >> > I have updated the above document with the following changes: > >> > - Reply with PAUSED any time a job was paused by user; > >> > - Start in paused state by default. > >> > > >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney > >> > wrote: > >> > > The doc mentioned that the scheduler will start an update subject to > >> the > >> > > heartbeat countdown, and if it doesn't receive a heartbeat it will > >> pause > >> > > the update. Why not start with the update > paused-due-to-no-heartbeat to > >> > > fail-fast any connectivity issues between the service providing the > >> > > heartbeats and the scheduler? > >> > > > >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > maxim@apache.org> > >> > > wrote: > >> > > > >> > >> Hi all, > >> > >> > >> > >> We are proposing a new feature for the scheduler updater, which you > >> > >> may find helpful. > >> > >> > >> > >> I have posed a brief feature summary here: > >> > >> > >> > >> > >> > > >> > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > >> > >> > >> > >> Please, reply with your feedback/concerns/comments. > >> > >> > >> > >> Thanks, > >> > >> Maxim > >> > >> > >> > > >> > --f46d0438954d355bb0050556ea57--