aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Isaac Councill (JIRA)" <>
Subject [jira] [Commented] (AURORA-634) Add a monitoring guide
Date Wed, 01 Oct 2014 17:09:34 GMT


Isaac Councill commented on AURORA-634:

+1 for usefulness of this documentation. Got this from Bill Farner on dev list:

There's a wealth of instrumentation exposed at /vars on the scheduler.  To
rattle off a few that are a good fit for monitoring:

If this value is increasing at a high rate, it's a sign of trouble.  Note:
this one is not monotonically increasing, it will decrease when old
terminated tasks are GCed.

Must be increasing, rate will depend on cluster size and behavior of other

Detecting resets on this value will tell you that the scheduler is failing
to stay alive.

If no schedulers have a '1' on this, then Aurora is not registered with

This gives you a moving window of log append latency,  A hike in this value
suggests disk IOP contention

Increase in this value indicates that Aurora is moving tasks into transient
states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly.

A high sustained value here suggests that the machine may be over-utilized.

An increase here indicates internal server errors responding to RPCs and
web UI loading.

> Add a monitoring guide
> ----------------------
>                 Key: AURORA-634
>                 URL:
>             Project: Aurora
>          Issue Type: Story
>          Components: Documentation
>            Reporter: Bill Farner
> Aurora provides a wealth of undocumented telemetry that is useful in monitoring a cluster.
 Add documentation about some of the recommended variables to use for monitoring and alerting.

This message was sent by Atlassian JIRA

View raw message