aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurora ReviewBot <wfar...@apache.org>
Subject Re: Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs
Date Wed, 12 Oct 2016 05:26:37 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52766/#review152269
-----------------------------------------------------------


Ship it!




Master (e9abb22) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On Oct. 12, 2016, 5:01 a.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52766/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2016, 5:01 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Zameer Manji.
> 
> 
> Bugs: AURORA-1791
>     https://issues.apache.org/jira/browse/AURORA-1791
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Fix a bug in commit ca683cb. The commit is related to this review https://reviews.apache.org/r/51876/.
Please see it for more details and backgrounds.
> 
> Currently, health checks are performed during a grace period called initial_interval_secs.
It is likely that HealthChecker fails to see sufficient number of successes before the intitial_interval_secs
expires. For example, for a task with HealthCheckConfig(initital_interval_secs=15, interval_secs=10,
min_consecutive_successes=1). If the task sleeps during the first 12 seconds and becomes healthy
afterwards, the health checker will report the task status as "TASK_FAILED" and miss the "healthy"
status between second 12-15. This is because only one health check is performed at second
10 before the initial_interval_secs expires. This is an implementation flaw that breaks backward-compatability.

> 
> To address this problem, I rewrite the function that is responsible for updating the
failure counts and the healthy status. The expected behavior is that for the task described
above, the health checker will performs a health check after the initial_interval_secs expires
and sets the health check status to be healthy. Please see this review for more details.
> 
> Will add some more tests since the current e2e tests does not include the above test
case.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/common/health_checker.py 1e0be108b49480d57c5ab94b1d2903bb57bae20a

>   src/test/python/apache/aurora/executor/common/test_health_checker.py 28769dca68a6353fc1283a8bb279fae05173aaac

> 
> Diff: https://reviews.apache.org/r/52766/diff/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
> 
> ./pants test.pytest src/test/python/apache/aurora/executor::
> 
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Modified the test in http_example.py. Let the http server sleep for the first 10 seconds.
> 
> Launch a job that contains the task with Default HealthCheckConfig(initial_interval_secs=15,
interval_secs=10, min_consecutive_successes=1) in vagrant aurora cluster. The task transitions
to TASK_RUNNING state after ~20 seconds.
> 
> 
> File Attachments
> ----------------
> 
> Task with default Health Check Config
>   https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png
> 
> 
> Thanks,
> 
> Kai Huang
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message