sling-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Egli (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
Date Tue, 04 Jul 2017 15:37:00 GMT

     [ https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stefan Egli updated SLING-5965:
-------------------------------
    Attachment: SLING-5965.v3.patch.txt
                numRunningJobs.tiff
                oldestRunningJob.tiff
                timers.tiff
                SchedulerHealthCheck.tiff

Attached [^SLING-5965.v3.patch.txt] 
h4. metrics
* the following metrics exist:
** number of currently running jobs
** oldest currently running job - if one is above a threshold (1000ms by default) and it creates
a temporary gauge for just that slow one, indicating the name of the slow job
** timers over all jobs
* all of the above is done
** grouped by thread pool name
** grouped by a configurable filter (to separate certain known slow or frequent jobs for example)
** grouped by slow jobs (auto-detected and auto-created when hit)

h4. number of running jobs metrics example
!numRunningJobs.tiff|thumbnail!

h4. oldest running job metrics example
!oldestRunningJob.tiff|thumbnail!

h4. timers metrics example
!timers.tiff|thumbnail!

h4. Scheduler Health Check
There's a scheduler health check which does the following:
* if there are 0 running jobs it's all green
* if there are 1 or more running jobs it checks how old the oldest one is
* if the oldest is older than what's configured (60000ms by default) then this health-check
becomes red and it tries to extract more infos as to which job is slow. it does that by listing
all {{sling:commons.scheduler.oldest.running.job.millis.slow.}} gauges and shows for each
how old it is (these {{slow}} gauges are auto-created when accessing any of the other {{sling:commons.scheduler.oldest.running.job.millis.}}
gauges).

!SchedulerHealthCheck.tiff|thumbnail!

reviews very welcome, /cc [~chetanm], [~cziegeler]

> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.6.4
>
>         Attachments: numRunningJobs.tiff, oldestRunningJob.tiff, SchedulerHealthCheck.tiff,
SLING-5965.patch, SLING-5965.v2.patch.txt, SLING-5965.v3.patch.txt, timers.tiff
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. They are
served from a thread-pool and should occupy that thread only for a short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they start to occupy
threads from that thread-pool, thus have an influence on the performance of other scheduled/quartz-jobs.
> We should have metrics (using [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
that provide information about internas of Sling Scheduler, such as average, max etc duration
of scheduled jobs, as well as how many jobs are currently running and since when was the oldest
job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and flag {{critical}}
when eg the oldest job is older than {{60'000ms}} (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message