aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1096) Scheduler updater should limit the number of job/instance events
Date Wed, 04 Feb 2015 02:24:34 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304513#comment-14304513
] 

Maxim Khutornenko commented on AURORA-1096:
-------------------------------------------

Right. Hence the second part. If we are to apply failure settings to every instance we may
penalize large services by not allowing higher failure tolerances. Also, if {{rollback_on_failure}}
is True, should we also account for it in the cap? Perhaps just letting the update to proceed
with warning like "Your update may fail due to exceeding the allowed event cap" could be a
better alternative to outright rejecting it.

> Scheduler updater should limit the number of job/instance events
> ----------------------------------------------------------------
>
>                 Key: AURORA-1096
>                 URL: https://issues.apache.org/jira/browse/AURORA-1096
>             Project: Aurora
>          Issue Type: Story
>          Components: Scheduler
>            Reporter: Maxim Khutornenko
>
> Large/flapping scheduler job updates may generate too many events in the update store.
The update settings are fully controlled by the user and there is a potential for a misconfigured
job update to completely overwhelm our in-memory DB storage with job update instance events.

> For example, a large flapping update with {{max_per_shard_failures}} and {{max_total_failures}}
set to max INT when left unattended can quickly consume all available RAM and kill the scheduler.
A manual cleanup of the scheduler log would be needed to bring the scheduler up.
> This can be especially relevant with the introduction of update heartbeats  (AURORA-690)
that can further exacerbate the problem (e.g. when {{blockIfNoPulseAfterMs}} set too low wrt
the external service pulse rate).
> We need to cap the max per-job lifetime count of {{JobUpdateEvent}} and {{JobInstanceUpdateEvent}}
instances. A nice bonus would be providing a hint in the UI when the event sequence is cut
off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message