Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@aurora.apache.org
Date: Wed, 12 Oct 2016 06:43:20 +0000 (UTC)
From: "Kai Huang (JIRA)" <jira@apache.org>
To: issues@aurora.apache.org
Message-ID: <JIRA.13011500.1476220841000.801888.1476254600488@Atlassian.JIRA>
In-Reply-To: <JIRA.13011500.1476220841000@Atlassian.JIRA>
References: <JIRA.13011500.1476220841000@Atlassian.JIRA> <JIRA.13011500.1476220841461@arcas>
Subject: [jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards
 compatible.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 12 Oct 2016 06:43:22 -0000


    [ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768 ] 

Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM:
-------------------------------------------------------------

To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to:  updating failure counter after initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options here:

(a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition?


was (Author: kaih):
To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to:  updating failure counter after initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options here:

(a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition?

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes.
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)