David Robinson created AURORA-493:
-------------------------------------
Summary: expose accurate metrics of state transitions
Key: AURORA-493
URL: https://issues.apache.org/jira/browse/AURORA-493
Project: Aurora
Issue Type: Task
Components: Scheduler
Reporter: David Robinson
Priority: Minor
The task store metrics (task_store_*) exposed via http://localhost:8081/vars aren't accurate
enough to be use for alerting purposes. At first glance the task_store_* metrics look like
they could be used to alert on LOST tasks (task_store_LOST) increasing (among other things),
but the numbers actually decrease as tasks are pruned. If a task becomes lost task_store_LOST
is incremented, but it's also decremented as lost tasks are pruned, therefore if both increment
and decrement occur within an alerting system's polling interval then the lost task(s) will
not be captured.
Consider adding counters of task state transitions that aren't touched when tasks are pruned
-- they should show the entire number of tasks that have transitioned through, or terminated
in each state.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
|