Return-Path: X-Original-To: apmail-aurora-reviews-archive@minotaur.apache.org Delivered-To: apmail-aurora-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 444EB10671 for ; Thu, 29 Jan 2015 01:55:47 +0000 (UTC) Received: (qmail 46379 invoked by uid 500); 29 Jan 2015 01:55:42 -0000 Delivered-To: apmail-aurora-reviews-archive@aurora.apache.org Received: (qmail 46333 invoked by uid 500); 29 Jan 2015 01:55:42 -0000 Mailing-List: contact reviews-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.incubator.apache.org Delivered-To: mailing list reviews@aurora.incubator.apache.org Received: (qmail 46322 invoked by uid 99); 29 Jan 2015 01:55:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jan 2015 01:55:42 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 29 Jan 2015 01:55:39 +0000 Received: (qmail 46254 invoked by uid 99); 29 Jan 2015 01:55:19 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Jan 2015 01:55:19 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id DDFF91D5C7C; Thu, 29 Jan 2015 01:55:14 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============5083858147215020345==" MIME-Version: 1.0 Subject: Re: Review Request 30225: Modifying update controller to support heartbeats. From: "Maxim Khutornenko" To: "Bill Farner" , "Joshua Cohen" , "David McLaughlin" Cc: "Aurora" , "Maxim Khutornenko" Date: Thu, 29 Jan 2015 01:55:14 -0000 Message-ID: <20150129015514.25680.73394@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Maxim Khutornenko" X-ReviewGroup: Aurora X-ReviewRequest-URL: https://reviews.apache.org/r/30225/ X-Sender: "Maxim Khutornenko" References: <20150128194129.25679.49329@reviews.apache.org> In-Reply-To: <20150128194129.25679.49329@reviews.apache.org> Reply-To: "Maxim Khutornenko" X-ReviewRequest-Repository: aurora X-Virus-Checked: Checked by ClamAV on apache.org --===============5083858147215020345== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, lines 259-263 > > > > > > I am unsure why this is being called inside pulse. Once pulse is activated, only the absence of a pulse can modify the update, right? We don't resume a paused update by receiving a pulse. > > > > So surely the last pulse time would be checked externally to the method that performs the pulse? > > > > If we can remove this, you can get rid of the write lock completely here, because all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates map correctly. > > Maxim Khutornenko wrote: > An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse. This covers a few important design desisions: > - An update can be created blocked by default awaiting for the first pulse to start its progress; > - An occasional network partition/delay will not require an explicit external service operation to resume; > - A scheduler restart is treated the same as initial update creation - an update is rehydrated and waits for a pulse to resume; > > More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > David McLaughlin wrote: > How do we show to the user (via client output or UI) that the update is currently blocked? > > Maxim Khutornenko wrote: > One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049 > > David McLaughlin wrote: > I don't think this is sufficient. We'd need auditing to explain to users why an update was paused (blocked) for a given time, not just the current status. > > Maxim Khutornenko wrote: > That would require persistance and changing the actual status of the job. I'd rather not introduce a new state that would only be applicable to specific update configurations. The more important here is the visibility into the internal "transient state" to troubleshoot a coordinated job unable to make progress. > > David McLaughlin wrote: > I don't follow the line of reasoning that 100% of updates have to use a feature for us to have a state to reflect it. I think in terms of simplicity and consistency, it makes way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised in and moved to when blocked. A pulse could then have one valid state transition that would require a write: UPDATE_WAITING_FOR_PULSE -> ROLLING_FORWARD. > > The drawbacks of this compared to this approach I think would be performance in the case of the external monitoring service flapping, or in the case of a scheduler failover. To combat this, we could also add some kind of initial grace period by setting a base timestamp at the point of leader election. This timestamp could be used in the "should this be blocked?" calculation when no previous pulse exists in the on-heap data structure. The consensus we reached on the dev list [1] is to implement the first version of the feature without any persistence at all. We agreed to look into reacher status reporting later as/if needed. IMO, adding a new state is a more involved solution. We would need to revisit the update state graph, figure out and validate the new rules for user pause/resume actions wrt the new state and pulse. Not impossible but hardly a blocking requirement either. [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser - Maxim ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review70058 ----------------------------------------------------------- On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30225/ > ----------------------------------------------------------- > > (Updated Jan. 23, 2015, 8:37 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. > > > Bugs: AURORA-1010 > https://issues.apache.org/jira/browse/AURORA-1010 > > > Repository: aurora > > > Description > ------- > > Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). > > Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd > src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5 > src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 > > Diff: https://reviews.apache.org/r/30225/diff/ > > > Testing > ------- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > > --===============5083858147215020345==--