aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Cohen <jco...@apache.org>
Subject Re: Review Request 53590: Change job updates to rely on `health-checks` rather than on `watch_secs`.
Date Wed, 09 Nov 2016 16:31:24 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53590/#review155440
-----------------------------------------------------------




RELEASE-NOTES.md (lines 19 - 21)
<https://reviews.apache.org/r/53590/#comment225385>

    We should expand this to include the fact that `watch_secs` must be set to 0 for this
behavior to kick in?



docs/development/design-documents.md (line 11)
<https://reviews.apache.org/r/53590/#comment225386>

    There's already a link further down for the original design doc. We should probably replace
that one with the new one (and make sure the new one references the original one?)



docs/features/job-updates.md (lines 51 - 53)
<https://reviews.apache.org/r/53590/#comment225387>

    Oof, this is woefully out of date (the client doesn't do any of this anymore, the scheduler
does).



docs/features/job-updates.md (line 56)
<https://reviews.apache.org/r/53590/#comment225388>

    should detail where this value is set (i.e. "by introducing a `min_consecutive_successess`
parameter on the `HealthCheckConfig` object")



docs/features/job-updates.md (line 57)
<https://reviews.apache.org/r/53590/#comment225389>

    "the `RUNNING` state"



docs/features/job-updates.md (line 58)
<https://reviews.apache.org/r/53590/#comment225390>

    same here for `RUNNING` and `FAILED`



docs/features/job-updates.md (line 60)
<https://reviews.apache.org/r/53590/#comment225391>

    s/Inorder/In order



docs/features/job-updates.md (line 65)
<https://reviews.apache.org/r/53590/#comment225392>

    copy editor nit: I think if you replace "purely based" with "based only" this reads a
little bit better.



docs/reference/configuration.md (line 382)
<https://reviews.apache.org/r/53590/#comment225393>

    s/health-check failures are ignore/during which health-check failures are ignored



src/main/python/apache/aurora/client/config.py (lines 112 - 125)
<https://reviews.apache.org/r/53590/#comment225394>

    for k, v in [('watch_secs', watch_secs), (...)]:
          die(INVALID_VALUE_ERROR_FORMAT % (k, v))
    
    is a little less repetitive?



src/main/python/apache/aurora/executor/aurora_executor.py (line 165)
<https://reviews.apache.org/r/53590/#comment225414>

    let's give the reason arg a name rather than accessing it out of `*args`?



src/main/python/apache/aurora/executor/aurora_executor.py (lines 173 - 178)
<https://reviews.apache.org/r/53590/#comment225396>

    Why not just change the signature of `from_assigned_task` on the `StatusCheckerProvider`
interface so we don't need to perform this `instanceof` check? Non-health-check providers
can just ignore the callback param.



src/main/python/apache/aurora/executor/aurora_executor.py (lines 174 - 175)
<https://reviews.apache.org/r/53590/#comment225397>

    style nit: one arg per line for continuation indents



src/main/python/apache/aurora/executor/aurora_executor.py (line 178)
<https://reviews.apache.org/r/53590/#comment225398>

    ditto here, re: continuation



src/main/python/apache/aurora/executor/common/health_checker.py (line 97)
<https://reviews.apache.org/r/53590/#comment225399>

    rename this to `start_time`?



src/main/python/apache/aurora/executor/common/health_checker.py (line 98)
<https://reviews.apache.org/r/53590/#comment225400>

    inline this below?



src/main/python/apache/aurora/executor/common/health_checker.py (lines 98 - 99)
<https://reviews.apache.org/r/53590/#comment225405>

    If we have a single failed health check after the grace period has expired, it will never
be possible for the task to be considered healthy and thus will always fail, right?
    
    Imagine `initial_interval_secs` set to 60, `interval_secs` set to 10, `min_consecutive_successes`
set to 3, and `max_consecutive_failures` set to 2.
    
    So, the deadline is 90 seconds.
    
    First 60 seconds we ignore health check failures. At t70 we perform a health check and
it fails. We increment `current_consecutive_failures` to 1 but since we haven't exceeded `max_consecutive_failures`
we keep waiting. Now at t80 we perform another health check and it succeeds. We increment
`current_consecutive_successes` to 1 and reset `current_consecutive_failures` to 0. `current_consecutive_successes`
is < 3, so we keep going. Repeat at t90, so `current_consecutive_successes` is now 2. At
t100 we check and see that we've exceeded the deadline `current_consecutive_successes` is
2 which is less than `min_consecutive_successes` so we fail the task. After that first failure
we never could have succeeded.
    
    Is this the desired behavior? If so, should we short circuit on the first failure outside
of the grace period regardless of the value of  `max_consecutive_failures`?
    
    Also, even if we pass all health checks, if we don't start passing health checks until
after the grace period and there's even a millisecond of delay at any time in the health check
loop, since the final deadline expiration is checked before the health check is run, even
if we're firing the third health check at ts90001 (for a 90 second deadline), we'll consider
the deadline expired and thus the task failed before get a chance to run the third health
check. Again, is that the desired outcome (a case can be made that your grace period should
account for this...).



src/main/python/apache/aurora/executor/common/health_checker.py (line 145)
<https://reviews.apache.org/r/53590/#comment225401>

    Set this after the callback is dispatched? I doubt an error invoking the callback is recoverable,
but it seems safer/cleaner to only set this after we know for certain that the callback has
invoked successfully.



src/main/python/apache/aurora/executor/common/health_checker.py (lines 164 - 168)
<https://reviews.apache.org/r/53590/#comment225403>

    Can you add a comment here explaining why this is safe vis-a-vis not having a potentially
dangling health check outside of the interval that could satisfy the min consecutive? I.e.
the deadline is calculated based on the initial interval + one interval for each required
health check?


- Joshua Cohen


On Nov. 9, 2016, 8:29 a.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53590/
> -----------------------------------------------------------
> 
> (Updated Nov. 9, 2016, 8:29 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Bugs: AURORA-1225
>     https://issues.apache.org/jira/browse/AURORA-1225
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Make RUNNING a first class state to indicate that the task is running
> and is healthy. It is achieved by introducing a new configuration
> parameter `min_consecutive_successes`, which will dictate when to move
> a task into RUNNING state.
> 
> With this change, it is possible to set the `watch_secs` to 0, so that
> updates are purely based on the task's health, rather than relying on
> watching the task to in RUNNING state for a pre-determined timeout.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 392496284b7f046a5dd29a5a3f175305c28a5afe 
>   docs/development/design-documents.md 6bfc679c841ef1b0861758ccc12a5150f1bdb5e3 
>   docs/features/job-updates.md 792f2ae5fd14b1ea9af8be000629ce5a7fc2fe8f 
>   docs/reference/configuration.md f2a0b1873f31e91f3bf0cac6f8448e8130fae688 
>   src/main/python/apache/aurora/client/api/updater_util.py c649316edb876565c92cc90c9f030e153c008924

>   src/main/python/apache/aurora/client/config.py 0186af52f0d7d7e3981ec59bf6a01aafee2bcfb1

>   src/main/python/apache/aurora/config/schema/base.py 845163043b0b7b2f9e7aca14677ca9f094658551

>   src/main/python/apache/aurora/executor/aurora_executor.py aee5e56ae66ec7a6acbe97d94d6187fd8646ec9a

>   src/main/python/apache/aurora/executor/common/health_checker.py 3c7c09d173c4adbd89eddaee22cffcdd8d268378

>   src/test/python/apache/aurora/client/test_config.py 5cf68a5145ddf9478baa30453c0bcb73136fa7eb

>   src/test/python/apache/aurora/executor/common/test_health_checker.py da0c56ca084b65427a6122f22f251af8637772d1

>   src/test/python/apache/aurora/executor/test_thermos_executor.py 3f82165333665d127e1ace765a7dada7348a4b91

> 
> Diff: https://reviews.apache.org/r/53590/diff/
> 
> 
> Testing
> -------
> 
> buils-support/jenkins/build.sh
> sh ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message