aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Huang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible.
Date Wed, 12 Oct 2016 06:43:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768
] 

Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM:
-------------------------------------------------------------

To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding
max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but ignores it until initial_interval_secs
expires. This does not cause any problem but does not seem clear to people. I've changed it
to:  updating failure counter after initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options here:

(a) Doing health checks periodically as defined. Even initial_interval_secs expires and min
successes is not reached (because periodic check will miss some successes), we do not fail
health check right away. Instead, we will rely on the latest health check to ensure the task
has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption that if a
task responds OK before initial_interval_secs expires, for next health check, it will still
responds OK. However, it's likely the task fails to respond OK until we perform this additional
health check. It's highly likely the instance will be healthy afterwards, but we should fail
the health check according to the definition?


was (Author: kaih):
To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding
max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until initial_interval_secs
expires. This does not cause any problem but does not seem clear to people. I've changed it
to:  updating failure counter after initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options here:

(a) Doing health checks periodically as defined. Even initial_interval_secs expires and min
successes is not reached (because periodic check will miss some successes), we do not fail
health check right away. Instead, we will rely on the latest health check to ensure the task
has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption that if a
task responds OK before initial_interval_secs expires, for next health check, it will still
responds OK. However, it's likely the task fails to respond OK until we perform this additional
health check. It's highly likely the instance will be healthy afterwards, but we should fail
the health check according to the definition?

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here
the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at the first
second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to initial_interval_secs
and have the task transition into RUNNING when {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health
checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health
check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message