Return-Path: X-Original-To: apmail-aurora-dev-archive@minotaur.apache.org Delivered-To: apmail-aurora-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7FDCB174B7 for ; Tue, 14 Oct 2014 19:39:00 +0000 (UTC) Received: (qmail 47500 invoked by uid 500); 14 Oct 2014 19:39:00 -0000 Delivered-To: apmail-aurora-dev-archive@aurora.apache.org Received: (qmail 47454 invoked by uid 500); 14 Oct 2014 19:39:00 -0000 Mailing-List: contact dev-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.incubator.apache.org Delivered-To: mailing list dev@aurora.incubator.apache.org Received: (qmail 47443 invoked by uid 99); 14 Oct 2014 19:39:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Oct 2014 19:39:00 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 14 Oct 2014 19:38:34 +0000 Received: (qmail 45620 invoked by uid 99); 14 Oct 2014 19:38:31 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Oct 2014 19:38:31 +0000 Received: from mail-wg0-f48.google.com (mail-wg0-f48.google.com [74.125.82.48]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 0B65D1A0446 for ; Tue, 14 Oct 2014 19:38:18 +0000 (UTC) Received: by mail-wg0-f48.google.com with SMTP id k14so11611412wgh.19 for ; Tue, 14 Oct 2014 12:38:27 -0700 (PDT) X-Gm-Message-State: ALoCoQmJ10A6HiGK3EygtutNliFDIbddJKLHK+crK5g21wQrMx8uWPuk6QI4s9g9PhXXGNCgHJk4 MIME-Version: 1.0 X-Received: by 10.180.73.168 with SMTP id m8mr7718985wiv.6.1413315507693; Tue, 14 Oct 2014 12:38:27 -0700 (PDT) Received: by 10.216.113.74 with HTTP; Tue, 14 Oct 2014 12:38:27 -0700 (PDT) In-Reply-To: References: Date: Tue, 14 Oct 2014 12:38:27 -0700 Message-ID: Subject: Re: Proposal: External Update Coordination From: Kevin Sweeney To: Aurora Content-Type: multipart/alternative; boundary=f46d043c06cacab6f00505672812 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043c06cacab6f00505672812 Content-Type: text/plain; charset=UTF-8 I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy signal, and punting on a more complex NACK signal (which we'd have to reliably persist). I think the only disagreement in this thread is whether the default state for a new update should be running or waiting-for-heartbeat. I think waiting for a heartbeat is not only a more correct implementation (no risk of acting after a failover but before the heartbeat timeout) but simpler to implement (initialize the PulseMonitor data structure as empty rather than with a synthetic heartbeat). >From an API consumer perspective the sequence is: 1. API client sends a startUpdate RPC to the scheduler 2. API client receives an OK response, then arranges for something to call heartbeat with that updateId on some interval 3. Whatever is supposed to send heartbeats sends one immediately, then starts sending them on some smaller interval Waiting for the first heartbeat ensures that this sequence has been completed successfully, while not waiting for it only ensure that step 1 has happened. On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner wrote: > Wait - simpler solution than what? We're talking about not doing either. > > -=Bill > > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney > wrote: > > > I think waiting for the first heartbeat before taking any action is the > > simpler solution here as it allows the implementation to be entirely > > soft-state and still catches the bugs I described. > > > > The implementation is just PulseMonitorImpl - heartbeat calls > > pulse and mutation operations check isAlive. I think the code might > > actually work as-is. > > > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko > > wrote: > > > > > Pausing update on creation seems like a logical approach when dealing > > > with inverted dependency model. I.e. updater is happy to act as long > > > as it's greenlighted by the external signal. It's also aligned with a > > > failover experience where coordinated updates are rehydrated in paused > > > state waiting for HB awakening. That said, I am OK punting it for the > > > sake of simplicity for now. > > > > > > Kevin? > > > > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner > > wrote: > > > > If the goal is to reduce complexity now and add features later, why > not > > > > nuke both for now - kick off the update right away, and let lack of > > > > heartbeats serve as a uniform "unknown or unhealthy" signal? > > > > > > > > -=Bill > > > > > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko > > > > wrote: > > > > > > > >> I am still +1 on the idea to have default paused state on creation. > I > > > >> think we could still differentiate between initially paused and > timed > > > >> out states internally by looking at pause reason. It's quite > different > > > >> if we want to store explicit NACK reasons from the external service > > > >> though. That would require persistence and a bit more complicated > > > >> logic. > > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney > > > wrote: > > > >> > I like the idea of implementing this scheduler-side purely through > > > >> volatile > > > >> > state, but the lack of feedback (generic vs specific error > messages > > > when > > > >> an > > > >> > update is paused) leaves something to be desired. Maybe we can > > address > > > >> that > > > >> > with a metadata field in the initial call to startUpdate (with an > > > >> optional > > > >> > link to a page where one can get more rich information about the > > > state of > > > >> > the monitor sending/not sending heartbeats). > > > >> > > > > >> > The main drawback is that we may have to wait a maximum of one > > > heartbeat > > > >> > interval to find out that an update should be paused. > > > >> > > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko < > > maxim@apache.org> > > > >> wrote: > > > >> > > > > >> >> The main reason I preferred the lack-of-ACK approach over an > > explicit > > > >> >> NACK one is simplicity. As Joshua pointed out there is more state > > to > > > >> >> handle in that case. The lack-of-ACK model can be completely > > > >> >> implemented in volatile memory sidestepping the persistent > storage > > > >> >> entirely. With the NACK we would need to reliably persist > external > > > >> >> service call reasons to survive scheduler failovers. Not a huge > > > >> >> challenge but something to keep in mind. > > > >> >> > > > >> >> I still think the simplicity/reliability tradeoff is acceptable > > here > > > >> >> if we rely on external service to abort heartbeats in case of a > > > health > > > >> >> alert fired. This can be explicitly documented as an external > > > >> >> integration requirement. However, If the consensus is to go a > more > > > >> >> reliable (though more complicated) NACK route I am happy to > > > reconsider > > > >> >> the current proposal. > > > >> >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen < > > > jcohen@twopensource.com> > > > >> >> wrote: > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't > have a > > > >> NACK. > > > >> >> If > > > >> >> > we are going to let lack-of-ACK serve as the NACK, i don't > think > > > it's > > > >> >> safe > > > >> >> > to resume when we receive another ACK. In other words, a > service > > > >> >> toggling > > > >> >> > unhealthy might not be deemed safe to proceed." > > > >> >> > > > > >> >> > Lack-of-ACK is the scenario where connectivity between the > > monitor > > > and > > > >> >> the > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario > (everything > > > is > > > >> not > > > >> >> > ok!) be handled by the monitoring service triggering an > explicit > > > >> pause? > > > >> >> > I.e. section 2 should be updated to say "External service > detects > > > >> service > > > >> >> > health problems and pauses the update" and section 4 becomes > the > > > >> current > > > >> >> > section 2 (i.e. "Should a heartbeat not be received the > scheduler > > > >> pauses > > > >> >> > the update."). > > > >> >> > > > > >> >> > I agree that it's unsafe to to resume updates after receiving a > > > >> heartbeat > > > >> >> > after previously pausing due to a missed heartbeat. In that > > > scenario > > > >> I'd > > > >> >> > think we'd want an explicit resumeJobUpdate. If the scenario > > we're > > > >> trying > > > >> >> > to handle is *never* received a heartbeat, that's a separate > > > matter, > > > >> in > > > >> >> > that case unpausing upon receiving the first heartbeat would > make > > > >> sense, > > > >> >> > but it feels like that complicates things quite a bit (now we > > need > > > to > > > >> >> > differentiate between heartbeat #1 and hearbeat #N). > > > >> >> > > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner < > wfarner@apache.org > > > > > > >> wrote: > > > >> >> > > > > >> >> >> What is the guidance for deploying while the heartbeat service > > is > > > >> >> broken? > > > >> >> >> I think i know the answer, but it's important to spell out. > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > Create a new coordinated job update in a paused > > > >> (ROLL_FORWARD_PAUSED) > > > >> >> >> > state to avoid any progress until the first heartbeat call > > > arrives. > > > >> >> >> > > > >> >> >> > > > >> >> >> I'm not sold on this being ultimately beneficial. In the > worst > > > case, > > > >> >> >> impact is still limited by the health check threshold. Seems > > like > > > >> >> >> premature optimization at best, and an odd one if we proceed > > > without > > > >> a > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC. > > > >> >> >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a > > > >> >> >> > resumeJobUpdate call. > > > >> >> >> > > > >> >> >> > > > >> >> >> Are heartbeats required while rolling back? If so, that might > > > impact > > > >> >> the > > > >> >> >> design here and in other places. > > > >> >> >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a > > > fresh > > > >> >> >> > heartbeatJobUpdate call. > > > >> >> >> > > > >> >> >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't > have a > > > >> NACK. > > > >> >> If > > > >> >> >> we are going to let lack-of-ACK serve as the NACK, i don't > think > > > it's > > > >> >> safe > > > >> >> >> to resume when we receive another ACK. In other words, a > > service > > > >> >> toggling > > > >> >> >> unhealthy might not be deemed safe to proceed. > > > >> >> >> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in case of a > > > >> user-paused > > > >> >> job > > > >> >> >> > update would make more sense as there is nothing monitoring > > > service > > > >> >> could > > > >> >> >> > do in that case. This should work fine with pause/resume > > > >> >> -aware/-agnostic > > > >> >> >> > monitoring service implementation. > > > >> >> >> > > > >> >> >> > > > >> >> >> This seems reasonable to me - heartbeats for a paused update > > > should > > > >> not > > > >> >> >> pose a risk, but can be safely ignored. > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> -=Bill > > > >> >> >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko < > > > >> maxim@apache.org> > > > >> >> >> wrote: > > > >> >> >> > > > >> >> >> > Agreed. That would be a logical generalization of the post > > > failover > > > >> >> >> > behavior. > > > >> >> >> > > > > >> >> >> > I have updated the above document with the following > changes: > > > >> >> >> > - Reply with PAUSED any time a job was paused by user; > > > >> >> >> > - Start in paused state by default. > > > >> >> >> > > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney < > > > >> kevints@apache.org> > > > >> >> >> > wrote: > > > >> >> >> > > The doc mentioned that the scheduler will start an update > > > >> subject to > > > >> >> >> the > > > >> >> >> > > heartbeat countdown, and if it doesn't receive a heartbeat > > it > > > >> will > > > >> >> >> pause > > > >> >> >> > > the update. Why not start with the update > > > >> >> paused-due-to-no-heartbeat to > > > >> >> >> > > fail-fast any connectivity issues between the service > > > providing > > > >> the > > > >> >> >> > > heartbeats and the scheduler? > > > >> >> >> > > > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > > > >> >> maxim@apache.org> > > > >> >> >> > > wrote: > > > >> >> >> > > > > > >> >> >> > >> Hi all, > > > >> >> >> > >> > > > >> >> >> > >> We are proposing a new feature for the scheduler updater, > > > which > > > >> you > > > >> >> >> > >> may find helpful. > > > >> >> >> > >> > > > >> >> >> > >> I have posed a brief feature summary here: > > > >> >> >> > >> > > > >> >> >> > >> > > > >> >> >> > > > > >> >> >> > > > >> >> > > > >> > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > >> >> >> > >> > > > >> >> >> > >> Please, reply with your feedback/concerns/comments. > > > >> >> >> > >> > > > >> >> >> > >> Thanks, > > > >> >> >> > >> Maxim > > > >> >> >> > >> > > > >> >> >> > > > > >> >> >> > > > >> >> > > > >> > > > > > > --f46d043c06cacab6f00505672812--