aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David McLaughlin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
Date Wed, 12 Oct 2016 01:12:20 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567175#comment-15567175
] 

David McLaughlin commented on AURORA-1791:
------------------------------------------

This was the proposal in Maxim's design doc:

{quote}
The new approach has an inherent risk of an update getting “stuck” due to an instance
persistently failing health checks (e.g. due to deadlock during application startup or failure
to bind to health port). To mitigate this, the initial_interval_secs will be repurposed to
serve as a grace interval for failing health checks. *Specifically, any health check failures
will be ignored during the grace interval*. There are 2 possible cases given the above:
Instance responds OK before initial_interval_secs expires → task moves into RUNNING.
Instance fails to respond OK and the initial_interval_secs expires → task moves into FAILED.
{quote}

Source: https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit#
(under Proposed Flow Overview)

> Commit ca683 is not backwards compatible.
> -----------------------------------------
>
>                 Key: AURORA-1791
>                 URL: https://issues.apache.org/jira/browse/AURORA-1791
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>            Assignee: Kai Huang
>            Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>       initial_interval_secs: 10
>       interval_secs: 5
>       max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here
the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at the first
second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to initial_interval_secs
and have the task transition into RUNNING when {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health
checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health
check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message