sling-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carsten Ziegeler (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (SLING-5965) Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
Date Fri, 11 Aug 2017 11:20:00 GMT

     [ https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Carsten Ziegeler reopened SLING-5965:
-------------------------------------

Sorry, I couldn't have a look earlier - but I think we should rethink some of the metrics
stuff.
The metrics itself seem to work fine and use Sling's commons metrics bundle
However the SchedulerHealthCheck requires classes from com.codahale.metrics which break the
abstraction of  commons metrics. Interestingly commons metrics is registring the MetricsRegistry
as a service, which allows this strange dependency and totally breaks isolation of commons
metrics.
So I think, we need to fix commons metrics wrt the registry and create our own version of
that interface. Then we can use it here.


> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.6.4
>
>         Attachments: numRunningJobs.jpg, oldestRunningJob.jpg, SchedulerHealthCheck.jpg,
SLING-5965.patch, SLING-5965.v2.patch.txt, SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt,
SLING-5965.v5.patch.txt, timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. They are
served from a thread-pool and should occupy that thread only for a short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they start to occupy
threads from that thread-pool, thus have an influence on the performance of other scheduled/quartz-jobs.
> We should have metrics (using [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
that provide information about internas of Sling Scheduler, such as average, max etc duration
of scheduled jobs, as well as how many jobs are currently running and since when was the oldest
job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and flag {{critical}}
when eg the oldest job is older than {{60'000ms}} (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message