Return-Path: X-Original-To: apmail-aurora-reviews-archive@minotaur.apache.org Delivered-To: apmail-aurora-reviews-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB61210366 for ; Thu, 5 Feb 2015 03:34:21 +0000 (UTC) Received: (qmail 4700 invoked by uid 500); 5 Feb 2015 02:34:21 -0000 Delivered-To: apmail-aurora-reviews-archive@aurora.apache.org Received: (qmail 4651 invoked by uid 500); 5 Feb 2015 02:34:21 -0000 Mailing-List: contact reviews-help@aurora.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.incubator.apache.org Delivered-To: mailing list reviews@aurora.incubator.apache.org Received: (qmail 4352 invoked by uid 99); 5 Feb 2015 02:34:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Feb 2015 02:34:21 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 05 Feb 2015 02:34:15 +0000 Received: (qmail 741 invoked by uid 99); 5 Feb 2015 02:33:55 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Feb 2015 02:33:55 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 699231CC459; Thu, 5 Feb 2015 02:33:51 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============7699754386762312738==" MIME-Version: 1.0 Subject: Re: Review Request 30225: Modifying update controller to support heartbeats. From: "Maxim Khutornenko" To: "Bill Farner" , "Joshua Cohen" , "David McLaughlin" Cc: "Aurora" , "Maxim Khutornenko" Date: Thu, 05 Feb 2015 02:33:51 -0000 Message-ID: <20150205023351.1285.61047@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Maxim Khutornenko" X-ReviewGroup: Aurora X-ReviewRequest-URL: https://reviews.apache.org/r/30225/ X-Sender: "Maxim Khutornenko" References: <20150204233434.1285.77850@reviews.apache.org> In-Reply-To: <20150204233434.1285.77850@reviews.apache.org> Reply-To: "Maxim Khutornenko" X-ReviewRequest-Repository: aurora X-Virus-Checked: Checked by ClamAV on apache.org --===============7699754386762312738== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > api/src/main/thrift/org/apache/aurora/gen/api.thrift, line 565 > > > > > > Maybe s/BLOCKED/AWAITING_PULSE/? > > > > That would at least self-document and avoid additional terminology. Renamed. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java, line 49 > > > > > > s/query/fetch/ to be consistent with the method above. I started with "fetch" but that resulted in an overloaded method that I did not quite liked. Changed it back. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 106 > > > > > > Remove Optional here, do Optional.of() at the call site. Removed. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 122 > > > > > > TODO + ticket, this map needs to be exposed via a debug endpoint Done. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java, line 273 > > > > > > I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration. > > > > Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible. > > > > For these reasons, i think it would be wise to decouple recording of pulses from reacting to them. It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input. > > Maxim Khutornenko wrote: > "I think avoid acquisition of a write lock here is a good goal to aim for. If, for example, we are already holding the write lock for a large snapshot, we could cause it to appear as though we have not received a pulse for an extended duration." - makes sense, I can try to split the recording part from the status update via an async action. > > "Also, given that there is likely to be an automated system on the other end, we should be somewhat defensive to a high call frequency. This suggests to me that potentially-expensive read operations should be avoided if possible." - I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK). > > > "It's also apparent that there is currently no asynchronous transition to the 'blocked' states, they're triggered either by a tardy pulse or another external action. I think this would be surprising behavior, and the scheduler should automatically transition to these states without any external input." - the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature. > > Bill Farner wrote: > > I don't see how this would be possible without breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to the pulse with proper status message (FINISHED|PAUSED|OK) > > Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? > > I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. > > > the transition to BLOCKED state happens during the evaluation loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.). Transitions to these states do not happen instantenously and are triggered by instance activities (i.e. task state change events). I don't see why BLOCKED state should be any different. The update is not technically alive until the evaluation event arrives. Changing this behavior just for heartbeats is not worth the extra complexity and will be inconsistent with the rest of the feature > > Transitions do indeed happen asynchronously when dealing with timeouts (relevant code is in `InstanceActionHandler`). These fall back to the same loop, but give the updater a time after which the state should be re-evaluated. > > Maxim Khutornenko wrote: > >Yeah, this crossed my mind. Perhaps we should poke at the API to see if we can shake those requirements out? For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller should follow up with a query to differentiate between a paused and finished update? I don't feel really strongly about this, but i think we should feel free to question the requirements at this phase. > > I don't think this would buy us much if external service starts following up each pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume. If we are concerned about obuse I think having a rate limiter would make more sense here. Also, I still need to pull instructions to determine if the pulse is legit (i.e. the update is actually a coordinated update). Otherwise, pulse goes nowhere and external service continues to pulse the wrong update. > > Bill Farner wrote: > > Also, I still need to pull instructions to determine if the pulse is legit (i.e. the update is actually a coordinated update). Otherwise, pulse goes nowhere and external service continues to pulse the wrong update. > > That's not _necessarily_ required, you could contain it in a `Runnable` implementation that is responsible for the async transition. > > Maxim Khutornenko wrote: > Not sure I get it. If it's in async part that means the pulse response is already gone and the external service is oblivious what it actually pulsed. > > David McLaughlin wrote: > For me the logic for pulse is fine, it's just the acquisition of the write lock that is a little off. It should be this: > > pulse(JobKey key, String updateId) { > // fetch the update status > // if the update is terminal > // return GO_AWAY > // if the update is non-terminal > // update the pulse > > // if the update is currently blocked > // acquire write lock > // change status > } > > > Give or take some operations. You could remove the fetching of the update status, but then you're just asking the client to start polling the scheduler every pulse instead to make the same decision. Moved to async action to avoid delaying pulse response. > On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote: > > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml, line 301 > > > > > > is it possible to reduce the diff a bit by moving this block to its former location, or is it needed This is diff playing tricks. I moved the filter code out to reuse it with details select. - Maxim ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30225/#review71063 ----------------------------------------------------------- On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30225/ > ----------------------------------------------------------- > > (Updated Feb. 4, 2015, 5:24 p.m.) > > > Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner. > > > Bugs: AURORA-1010 > https://issues.apache.org/jira/browse/AURORA-1010 > > > Repository: aurora > > > Description > ------- > > Added pulsing support into the JobUpdateController. The qualified coordinated updates get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs` expires or the update reaches a terminal state (whichever comes first). > > Not particularly happy with plumbing through OneWayJobUpdater but the alternative is a state machine change, which is much hairier and will require more changes in the JobUpdaterController. Going with the minimal diff here. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf > src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e > src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1 > src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e > src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2 > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31 > src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173 > src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml f9c9ceddc559b43b4a5c45c745d54ff47484edde > src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde > src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72 > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969 > src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987 > > Diff: https://reviews.apache.org/r/30225/diff/ > > > Testing > ------- > > ./gradlew -Pq build > > > Thanks, > > Maxim Khutornenko > > --===============7699754386762312738==--