sling-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Egli (JIRA)" <>
Subject [jira] [Updated] (SLING-7778) race-condition in scheduled job creation might keep scheduled job in-memory, and cause a duplicate scheduled job execution
Date Mon, 16 Jul 2018 11:59:00 GMT


Stefan Egli updated SLING-7778:
    Attachment: SLING-7778.patch

> race-condition in scheduled job creation might keep scheduled job in-memory, and cause
a duplicate scheduled job execution
> --------------------------------------------------------------------------------------------------------------------------
>                 Key: SLING-7778
>                 URL:
>             Project: Sling
>          Issue Type: Bug
>          Components: Event
>    Affects Versions: Event 4.2.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>             Fix For: Event 4.2.12
>         Attachments: SLING-7778.patch
> h2. case 1 : (extremely unlikely) observation faster than addOrUpdateJob can even finish
> The creation of a scheduled job is handled in [{{JobSchedulerImpl.addScheduledJob}}|]
which in turn [calls {{scheduledJobHandler.addOrUpdateJob}}|].
There, the job is first persisted to the repository via [calling {{writeScheduledJob}}|].
> Normally the sequence of events is then as follows:
>  * [A] the [synchronized block in {{addOrUpdateJob}}|]
is entered and the job is scheduled with quartz.
>  * [B] then shortly afterwards, observation is triggered and {{handleAddOrUpdate}} is
called. This will, as a *side-effect* [add the scheduled job to the {{scheduledJobs}} map|]
>  * the result of this being that the scheduled job is now in {{ScheduledJobHandler.scheduledJobs}}
>  * later on, when the scheduled job is finished, {{handleRemove()}} (or similar) is called,
and it checks if it finds the scheduled job in {{ScheduledJobHandler.scheduledJobs}}. if it
does, all good. And here we're in the all good case, so all is fine
> However, there's the following race-condition:
>  * first [B] happens, ie first the observation event is triggered, so {{handleAddOrUpdate}}
is called, and again the side-effect hits: the scheduled job is added to the {{ScheduledJobHandler.scheduledJobs}}.
all good so far
>  * then [A] continues into the synchronized block. *But* now, the [remove of the scheduled
actually does damage: the scheduled job ends up *not* being in the {{ScheduledJobHandler.scheduledJobs}}
>  * so in this scenario, when a later {{handleRemove()}} (or similar) happens, it can't
find the scheduled job (see [here|]
h being null)
>  * the result being that [{{JobScheduler.unscheduleJob}}|]
is *not invoked*, *leaving the scheduled job with the JobSchedulerImpl*.
> And the result of this race-condition is that {{JobSchedulerImpl.scheduledJobs}} map
has a zombie scheduled job, which it shouldn't actually have. Note *this situation can be
verified in the /system/console/slingevent webconsole* under "Scheduled Jobs".
> So later on for example, a topology event happens, such as a mere PROPERTIES_CHANGED_EVENT.
That in turn will trigger [{{JobSchedulerImpl.configurationChanged()}}|]
which will now *re-schedule the previously scheduled job*. And we have a duplicate scheduled
job execution.
> h2. case 2: (likely) scheduled job finishes before observation event triggered
> Besides case 1, there's also the *more likely* case 2: where a scheduled job finishes
(thus calls {{ScheduledJobHandler.remove()}}) before the observation event for {{ADDED}} was
ever called (leading to {{ScheduledJobHandler.handleAddOrUpdate()}}).
> In this case, the {{remove()}} fails to delete the scheduled job from the (parent) {{JobSchedulerImpl}},
as it doesn't have it in the {{scheduledJob}} list yet.
> Which results - same as case 1 - in an inconsistency leading to a stale {{ScheduledJobInfoImpl}}
in {{JobSchedulerImpl.scheduledJobs}} map - which when a PROPERTIES_CHANGED_EVENT is called
leads to duplicate scheduled job execution.
> Note that this is only likely to happen if the schedule is set for 'very soon now', eg
{{at(NOW+1000)}} or something along those lines. If the schedule is couple minutes in the
future, the likelyhood of observation not triggering before drops towards zero.
> So overall, this scenario is likely to happen only with such 'immediate schedules' and
could be mitigated by increasing the delay to eg 1 or a couple of minutes.

This message was sent by Atlassian JIRA

View raw message