aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko (JIRA)" <>
Subject [jira] [Commented] (AURORA-1041) Allow job uptime stats to control scheduler updater pace
Date Wed, 21 Jan 2015 20:37:34 GMT


Maxim Khutornenko commented on AURORA-1041:

Absolutely not. This is an optional strategy to be used for large(er) jobs where service stability
is paramount. It can be used by itself as well as in combination with the {{batch_size}}.

> Allow job uptime stats to control scheduler updater pace 
> ---------------------------------------------------------
>                 Key: AURORA-1041
>                 URL:
>             Project: Aurora
>          Issue Type: Task
>          Components: Client, Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
> The current implementation of the scheduler updater relies on a user-defined {{batch_size}}
value to determine how many instances can be updated simultaneously. While this approach is
well understood and battle tested, it comes with its own risks/inefficiencies:
> - No knowledge of job health outside of an active batch. Once an instance graduates the
{{watch_secs}} interval it's considered "healthy" and is never looked at by the updater. Even
if updated instances start flapping later, the updater keeps on going;
> - The {{batch_size}} fixed value may artificially slow down the updater progress as it's
usually chosen conservatively as the max number of instances a service can tolerate at any
given moment and may not reflect the actual job restart pace (see related AURORA-894).
> - Instances are evaluated/updated in a ordered fashion resulting in any new instances
coming up at the very end of an update sequence that both updates the existing instances and
adds new ones.
> The proposed solution will capitalize on the concept of *job uptime* introduced in AURORA-290
and will allow scheduler updater to proceed as long as the "X% of instances up over Y interval"
job invariant is met.

This message was sent by Atlassian JIRA

View raw message