aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Farner" <wfar...@apache.org>
Subject Re: Review Request 30225: Modifying update controller to support heartbeats.
Date Thu, 29 Jan 2015 21:32:07 GMT


> On Jan. 28, 2015, 7:41 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
lines 259-263
> > <https://reviews.apache.org/r/30225/diff/1/?file=832014#file832014line259>
> >
> >     I am unsure why this is being called inside pulse. Once pulse is activated,
only the absence of a pulse can modify the update, right? We don't resume a paused update
by receiving a pulse. 
> >     
> >     So surely the last pulse time would be checked externally to the method that
performs the pulse? 
> >     
> >     If we can remove this, you can get rid of the write lock completely here, because
all you need are strongly consistent reads (which we have) to accurately update the cooridinatedUpdateStates
map correctly.
> 
> Maxim Khutornenko wrote:
>     An update blocked (not PAUSED) due to a missed pulse can be unblocked by a new pulse.
This covers a few important design desisions:
>     - An update can be created blocked by default awaiting for the first pulse to start
its progress;
>     - An occasional network partition/delay will not require an explicit external service
operation to resume;
>     - A scheduler restart is treated the same as initial update creation - an update
is rehydrated and waits for a pulse to resume;
>     
>     More details and scenarios here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> 
> David McLaughlin wrote:
>     How do we show to the user (via client output or UI) that the update is currently
blocked?
> 
> Maxim Khutornenko wrote:
>     One possible way is described here: https://issues.apache.org/jira/browse/AURORA-1049
> 
> David McLaughlin wrote:
>     I don't think this is sufficient. We'd need auditing to explain to users why an update
was paused (blocked) for a given time, not just the current status.
> 
> Maxim Khutornenko wrote:
>     That would require persistance and changing the actual status of the job. I'd rather
not introduce a new state that would only be applicable to specific update configurations.
The more important here is the visibility into the internal "transient state" to troubleshoot
a coordinated job unable to make progress.
> 
> David McLaughlin wrote:
>     I don't follow the line of reasoning that 100% of updates have to use a feature for
us to have a state to reflect it. I think in terms of simplicity and consistency, it makes
way more sense to have an explicit UPDATE_WAITING_FOR_PULSE state that these updates are initialised
in and moved to when blocked. A pulse could then have one valid state transition that would
require a write: UPDATE_WAITING_FOR_PULSE -> ROLLING_FORWARD. 
>     
>     The drawbacks of this compared to this approach I think would be performance in the
case of the external monitoring service flapping, or in the case of a scheduler failover.
To combat this, we could also add some kind of initial grace period by setting a base timestamp
at the point of leader election. This timestamp could be used in the "should this be blocked?"
calculation when no previous pulse exists in the on-heap data structure.
> 
> Maxim Khutornenko wrote:
>     The consensus we reached on the dev list [1] is to implement the first version of
the feature without any persistence at all. We agreed to look into reacher status reporting
later as/if needed. 
>     
>     IMO, adding a new state is a more involved solution. We would need to revisit the
update state graph, figure out and validate the new rules for user pause/resume actions wrt
the new state and pulse. Not impossible but hardly a blocking requirement either.
>     
>     [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser
> 
> David McLaughlin wrote:
>     On that email thread, my last comment is to bring up this exact point :)
>     
>     I'm -1 on proceeding without auditing for blocked state. But if the majority disagrees,
then that's fine and I'm fine for this to proceed with another ship it.

David - I do think you raise a valid point, and we at least need to determine whether this
is a reasonable tradeoff.  Can you start a thread on dev@ to try to reach consensus?


- Bill


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review70058
-----------------------------------------------------------


On Jan. 23, 2015, 8:37 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30225/
> -----------------------------------------------------------
> 
> (Updated Jan. 23, 2015, 8:37 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
> 
> 
> Bugs: AURORA-1010
>     https://issues.apache.org/jira/browse/AURORA-1010
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Added pulsing support into the JobUpdateController. The qualified coordinated updates
get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs`
expires or the update reaches a terminal state (whichever comes first).
> 
> Not particularly happy with plumbing through OneWayJobUpdater but the alternative is
a state machine change, which is much hairier and will require more changes in the JobUpdaterController.
Going with the minimal diff here.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java a992938d4e12b20f81608be6bbdc24c0a211c3fd

>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173

>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 4c827b183a87b4d97774edbfaa960bd1c3de72a5

>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987

> 
> Diff: https://reviews.apache.org/r/30225/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message