aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Huang <texasred2...@hotmail.com>
Subject Review Request 52766: Fix a bug in insufficient successes during initial_interval_secs
Date Wed, 12 Oct 2016 01:18:32 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52766/
-----------------------------------------------------------

Review request for Aurora, Joshua Cohen and Zameer Manji.


Bugs: AURORA-1791
    https://issues.apache.org/jira/browse/AURORA-1791


Repository: aurora


Description
-------

Fix a bug in commit ca683cb. The commit is related to this review https://reviews.apache.org/r/51876/.
Please see it for more details and backgrounds.

Currently, health checks are performed during a grace period called initial_interval_secs.
It is likely that HealthChecker fails to see sufficient number of successes before the intitial_interval_secs
expires. For example, for a task with HealthCheckConfig(initital_interval_secs=15, interval_secs=10,
min_consecutive_successes=1). If the task sleeps during the first 12 seconds and becomes healthy
afterwards, the health checker will report the task status as "TASK_FAILED" and miss the "healthy"
status between second 12-15. This is because only one health check is performed at second
10 before the initial_interval_secs expires. This is an implementation flaw that breaks backward-compatability.


To address this problem, I rewrite the function that is responsible for updating the failure
counts and the healthy status. The expected behavior is that for the task described above,
the health checker will performs a health check after the initial_interval_secs expires and
sets the health check status to be healthy. Please see this review for more details.

Will add some more tests since the current e2e tests does not include the above test case.


Diffs
-----

  src/main/python/apache/aurora/executor/common/health_checker.py 1e0be108b49480d57c5ab94b1d2903bb57bae20a

  src/test/python/apache/aurora/executor/common/test_health_checker.py 28769dca68a6353fc1283a8bb279fae05173aaac


Diff: https://reviews.apache.org/r/52766/diff/


Testing
-------

./build-support/jenkins/build.sh

./pants test.pytest src/test/python/apache/aurora/executor::

./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Modified the test in http_example.py. Let the http server sleep for the first 10 seconds.

Launch a job that contains the task with Default HealthCheckConfig(initial_interval_secs=15,
interval_secs=10, min_consecutive_successes=1) in vagrant aurora cluster. The task transitions
to TASK_RUNNING state after ~20 seconds.


File Attachments
----------------

Task with default Health Check Config
  https://reviews.apache.org/media/uploaded/files/2016/10/12/64cf6610-9294-46cb-b159-6e5721da5fff__Screen_Shot_2016-10-11_at_6.17.00_PM.png


Thanks,

Kai Huang


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message