aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko" <ma...@apache.org>
Subject Re: Review Request 30225: Modifying update controller to support heartbeats.
Date Thu, 05 Feb 2015 00:56:19 GMT


> On Feb. 4, 2015, 11:34 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java,
line 273
> > <https://reviews.apache.org/r/30225/diff/4/?file=848246#file848246line273>
> >
> >     I think avoid acquisition of a write lock here is a good goal to aim for.  If,
for example, we are already holding the write lock for a large snapshot, we could cause it
to appear as though we have not received a pulse for an extended duration.
> >     
> >     Also, given that there is likely to be an automated system on the other end,
we should be somewhat defensive to a high call frequency.  This suggests to me that potentially-expensive
read operations should be avoided if possible.
> >     
> >     For these reasons, i think it would be wise to decouple recording of pulses
from reacting to them.  It's also apparent that there is currently no asynchronous transition
to the 'blocked' states, they're triggered either by a tardy pulse or another external action.
 I think this would be surprising behavior, and the scheduler should automatically transition
to these states without any external input.
> 
> Maxim Khutornenko wrote:
>     "I think avoid acquisition of a write lock here is a good goal to aim for.  If, for
example, we are already holding the write lock for a large snapshot, we could cause it to
appear as though we have not received a pulse for an extended duration." - makes sense, I
can try to split the recording part from the status update via an async action. 
>     
>     "Also, given that there is likely to be an automated system on the other end, we
should be somewhat defensive to a high call frequency.  This suggests to me that potentially-expensive
read operations should be avoided if possible." - I don't see how this would be possible without
breaking the RPC contract as I still have to pull JobUpdateDetails in order to respond to
the pulse with proper status message (FINISHED|PAUSED|OK).
>     
>     
>     "It's also apparent that there is currently no asynchronous transition to the 'blocked'
states, they're triggered either by a tardy pulse or another external action.  I think this
would be surprising behavior, and the scheduler should automatically transition to these states
without any external input." - the transition to BLOCKED state happens during the evaluation
loop similar to other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED
and etc.). Transitions to these states do not happen instantenously and are triggered by instance
activities (i.e. task state change events). I don't see why BLOCKED state should be any different.
The update is not technically alive until the evaluation event arrives. Changing this behavior
just for heartbeats is not worth the extra complexity and will be inconsistent with the rest
of the feature.
> 
> Bill Farner wrote:
>     > I don't see how this would be possible without breaking the RPC contract as
I still have to pull JobUpdateDetails in order to respond to the pulse with proper status
message (FINISHED|PAUSED|OK)
>     
>     Yeah, this crossed my mind.  Perhaps we should poke at the API to see if we can shake
those requirements out?  For example, we could just return OK/UPDATE_NOT_ACTIVE, and the caller
should follow up with a query to differentiate between a paused and finished update?
>     
>     I don't feel really strongly about this, but i think we should feel free to question
the requirements at this phase.
>     
>     > the transition to BLOCKED state happens during the evaluation loop similar to
other internally triggered state transitions (FAILED|ERROR|ROLL_FORWARD_COMPLETED and etc.).
Transitions to these states do not happen instantenously and are triggered by instance activities
(i.e. task state change events). I don't see why BLOCKED state should be any different. The
update is not technically alive until the evaluation event arrives. Changing this behavior
just for heartbeats is not worth the extra complexity and will be inconsistent with the rest
of the feature
>     
>     Transitions do indeed happen asynchronously when dealing with timeouts (relevant
code is in `InstanceActionHandler`).  These fall back to the same loop, but give the updater
a time after which the state should be re-evaluated.
> 
> Maxim Khutornenko wrote:
>     >Yeah, this crossed my mind.  Perhaps we should poke at the API to see if we can
shake those requirements out?  For example, we could just return OK/UPDATE_NOT_ACTIVE, and
the caller should follow up with a query to differentiate between a paused and finished update?
I don't feel really strongly about this, but i think we should feel free to question the requirements
at this phase.
>     
>     I don't think this would buy us much if external service starts following up each
pulse that returns UPDATE_NOT_ACTIVE. It would just makes the contract harder to consume.
If we are concerned about obuse I think having a rate limiter would make more sense here.
Also, I still need to pull instructions to determine if the pulse is legit (i.e. the update
is actually a coordinated update). Otherwise, pulse goes nowhere and external service continues
to pulse the wrong update.
> 
> Bill Farner wrote:
>     > Also, I still need to pull instructions to determine if the pulse is legit (i.e.
the update is actually a coordinated update). Otherwise, pulse goes nowhere and external service
continues to pulse the wrong update.
>     
>     That's not _necessarily_ required, you could contain it in a `Runnable` implementation
that is responsible for the async transition.

Not sure I get it. If it's in async part that means the pulse response is already gone and
the external service is oblivious what it actually pulsed.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30225/#review71063
-----------------------------------------------------------


On Feb. 4, 2015, 5:24 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30225/
> -----------------------------------------------------------
> 
> (Updated Feb. 4, 2015, 5:24 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Joshua Cohen, and Bill Farner.
> 
> 
> Bugs: AURORA-1010
>     https://issues.apache.org/jira/browse/AURORA-1010
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Added pulsing support into the JobUpdateController. The qualified coordinated updates
get blocked until a pulse arrives. An update then becomes active and proceeds until `blockIfNoPulsesAfterMs`
expires or the update reaches a terminal state (whichever comes first).
> 
> Not particularly happy with plumbing through OneWayJobUpdater but the alternative is
a state machine change, which is much hairier and will require more changes in the JobUpdaterController.
Going with the minimal diff here.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 4f603f9e7ed004e53937a45ea8edf7241e15f5cf

>   src/main/java/org/apache/aurora/scheduler/storage/ForwardingStore.java 7f2ec71d744d5fafac84291cc79feba3398d0e1e

>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java b7d8d524e15f101416889c00efc3ecf2c8d9c1a1

>   src/main/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStore.java d479d20259f284528b2291e2e486b36d8e47fe5e

>   src/main/java/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.java 60f535988a20638fb16515d5027bfa65f1afb73c

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java d3b30d48b76d8d7c64cda006a34f7ed3296526f2

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java ec9d1e07abca1bf30262e1c0f741a34741e96f5d

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachine.java 76460f95e058181b711fb6869f5a34c1d5bdfe31

>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 27a5b9026f5ac3b3bdeb32813b10435bc3dab173

>   src/main/resources/org/apache/aurora/scheduler/storage/db/JobUpdateDetailsMapper.xml
f9c9ceddc559b43b4a5c45c745d54ff47484edde 
>   src/test/java/org/apache/aurora/scheduler/storage/db/DBJobUpdateStoreTest.java ca7c0c2675477cc727ca006697665f997972dfde

>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdateStateMachineTest.java 89765ac3d34a827d3748fb96a275d78e9d1b8b72

>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 41d422939209d0808235128e4242c11e8ef25969

>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 7d0a7438b4a517e5e0d44f4e99aceb1a6d19f987

> 
> Diff: https://reviews.apache.org/r/30225/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message