aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Lambert (JIRA)" <>
Subject [jira] [Updated] (AURORA-1096) Scheduler updater should limit the number of job/instance events
Date Mon, 06 Jul 2015 20:43:05 GMT


Chris Lambert updated AURORA-1096:
    Sprint: Twitter Aurora Q2'15 Sprint 7

> Scheduler updater should limit the number of job/instance events
> ----------------------------------------------------------------
>                 Key: AURORA-1096
>                 URL:
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>            Assignee: Joe Smith
> Large/flapping scheduler job updates may generate too many events in the update store.
The update settings are fully controlled by the user and there is a potential for a misconfigured
job update to completely overwhelm our in-memory DB storage with job update instance events.

> For example, a large flapping update with {{max_per_shard_failures}} and {{max_total_failures}}
set to max INT when left unattended can quickly consume all available RAM and kill the scheduler.
A manual cleanup of the scheduler log would be needed to bring the scheduler up.
> This can be especially relevant with the introduction of update heartbeats  (AURORA-690)
that can further exacerbate the problem (e.g. when {{blockIfNoPulseAfterMs}} set too low wrt
the external service pulse rate).
> We need to cap the max per-job lifetime count of {{JobUpdateEvent}} and {{JobInstanceUpdateEvent}}
instances. A nice bonus would be providing a hint in the UI when the event sequence is cut

This message was sent by Atlassian JIRA

View raw message