Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3C5CF200B9B for ; Wed, 12 Oct 2016 08:43:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3AE44160AD4; Wed, 12 Oct 2016 06:43:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 81D3D160AD3 for ; Wed, 12 Oct 2016 08:43:21 +0200 (CEST) Received: (qmail 92505 invoked by uid 500); 12 Oct 2016 06:43:20 -0000 Mailing-List: contact issues-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list issues@aurora.apache.org Received: (qmail 92493 invoked by uid 99); 12 Oct 2016 06:43:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2016 06:43:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 784DA2C0088 for ; Wed, 12 Oct 2016 06:43:20 +0000 (UTC) Date: Wed, 12 Oct 2016 06:43:20 +0000 (UTC) From: "Kai Huang (JIRA)" To: issues@aurora.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 12 Oct 2016 06:43:22 -0000 [ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567768#comment-15567768 ] Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM: ------------------------------------------------------------- To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. In commit ca683, I keep updating the failure counter but ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to: updating failure counter after initial_interval_secs expires. For the root cause of the issue, min_consecutive_successes, we have two options here: (a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. (b) Doing an additional health check whenever initial_interval_secs expires. In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition? was (Author: kaih): To sum up, the issue is caused by failed to reach min_consecutive_successes, not exceeding max_consecutive_failures. In commit ca683, I keep updating the failure counter but only ignores it until initial_interval_secs expires. This does not cause any problem but does not seem clear to people. I've changed it to: updating failure counter after initial_interval_secs expires. For the root cause of the issue, min_consecutive_successes, we have two options here: (a) Doing health checks periodically as defined. Even initial_interval_secs expires and min successes is not reached (because periodic check will miss some successes), we do not fail health check right away. Instead, we will rely on the latest health check to ensure the task has already been in healthy state. (b) Doing an additional health check whenever initial_interval_secs expires. In my recent review request, I implemented (a). This is based on the assumption that if a task responds OK before initial_interval_secs expires, for next health check, it will still responds OK. However, it's likely the task fails to respond OK until we perform this additional health check. It's highly likely the instance will be healthy afterwards, but we should fail the health check according to the definition? > Commit ca683 is not backwards compatible. > ----------------------------------------- > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug > Reporter: Zameer Manji > Assignee: Kai Huang > Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)