aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Isaac Councill (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-634) Add a monitoring guide
Date Wed, 01 Oct 2014 17:09:34 GMT

    [ https://issues.apache.org/jira/browse/AURORA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155130#comment-14155130
] 

Isaac Councill commented on AURORA-634:
---------------------------------------

+1 for usefulness of this documentation. Got this from Bill Farner on dev list:

There's a wealth of instrumentation exposed at /vars on the scheduler.  To
rattle off a few that are a good fit for monitoring:

task_store_LOST
If this value is increasing at a high rate, it's a sign of trouble.  Note:
this one is not monotonically increasing, it will decrease when old
terminated tasks are GCed.

scheduler_resource_offers
Must be increasing, rate will depend on cluster size and behavior of other
frameworks.

jvm_uptime_secs
Detecting resets on this value will tell you that the scheduler is failing
to stay alive.

framework_registered
If no schedulers have a '1' on this, then Aurora is not registered with
mesos.

rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)
This gives you a moving window of log append latency,  A hike in this value
suggests disk IOP contention

timed_out_tasks
Increase in this value indicates that Aurora is moving tasks into transient
states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly.

system_load_avg
A high sustained value here suggests that the machine may be over-utilized.

http_500_responses_events
An increase here indicates internal server errors responding to RPCs and
web UI loading.


> Add a monitoring guide
> ----------------------
>
>                 Key: AURORA-634
>                 URL: https://issues.apache.org/jira/browse/AURORA-634
>             Project: Aurora
>          Issue Type: Story
>          Components: Documentation
>            Reporter: Bill Farner
>
> Aurora provides a wealth of undocumented telemetry that is useful in monitoring a cluster.
 Add documentation about some of the recommended variables to use for monitoring and alerting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message