aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Huang <>
Subject Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs
Date Wed, 12 Oct 2016 01:18:32 GMT

This is an automatically generated e-mail. To reply, visit:

Review request for Aurora, Joshua Cohen and Zameer Manji.

Bugs: AURORA-1791

Repository: aurora


Fix a bug in commit ca683cb. The commit is related to this review
Please see it for more details and backgrounds.

Currently, health checks are performed during a grace period called initial_interval_secs.
It is likely that HealthChecker fails to see sufficient number of successes before the intitial_interval_secs
expires. For example, for a task with HealthCheckConfig(initital_interval_secs=15, interval_secs=10,
min_consecutive_successes=1). If the task sleeps during the first 12 seconds and becomes healthy
afterwards, the health checker will report the task status as "TASK_FAILED" and miss the "healthy"
status between second 12-15. This is because only one health check is performed at second
10 before the initial_interval_secs expires. This is an implementation flaw that breaks backward-compatability.

To address this problem, I rewrite the function that is responsible for updating the failure
counts and the healthy status. The expected behavior is that for the task described above,
the health checker will performs a health check after the initial_interval_secs expires and
sets the health check status to be healthy. Please see this review for more details.

Will add some more tests since the current e2e tests does not include the above test case.


  src/main/python/apache/aurora/executor/common/ 1e0be108b49480d57c5ab94b1d2903bb57bae20a

  src/test/python/apache/aurora/executor/common/ 28769dca68a6353fc1283a8bb279fae05173aaac




./pants test.pytest src/test/python/apache/aurora/executor::


Modified the test in Let the http server sleep for the first 10 seconds.

Launch a job that contains the task with Default HealthCheckConfig(initial_interval_secs=15,
interval_secs=10, min_consecutive_successes=1) in vagrant aurora cluster. The task transitions
to TASK_RUNNING state after ~20 seconds.

File Attachments

Task with default Health Check Config


Kai Huang

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message