aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <>
Subject Re: monitoring aurora scheduler
Date Tue, 30 Sep 2014 21:33:24 GMT
Firstly, please chime in on AURORA-634 to nudge us to formally document

There's a wealth of instrumentation exposed at /vars on the scheduler.  To
rattle off a few that are a good fit for monitoring:

If this value is increasing at a high rate, it's a sign of trouble.  Note:
this one is not monotonically increasing, it will decrease when old
terminated tasks are GCed.

Must be increasing, rate will depend on cluster size and behavior of other

Detecting resets on this value will tell you that the scheduler is failing
to stay alive.

If no schedulers have a '1' on this, then Aurora is not registered with

This gives you a moving window of log append latency,  A hike in this value
suggests disk IOP contention

Increase in this value indicates that Aurora is moving tasks into transient
states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly.

A high sustained value here suggests that the machine may be over-utilized.

An increase here indicates internal server errors responding to RPCs and
web UI loading.

I'd love to know more about the specific issue you encountered.  Do the
scheduler logs indicate anything unusual during the period of downtime?


On Tue, Sep 30, 2014 at 1:59 PM, Isaac Councill <> wrote:

> I've been having a bad time with the great AWS Xen reboot, and thought it
> would be a good time to revamp monitoring among other things.
> Do you have any recommendations for monitoring scheduler health? I've got
> my own ideas, but am more interested in learning about twitter prod
> monitoring.
> For context, last night's failure:
> Running aurora-scheduler from head, cut last week. Could find the exact
> commit if interesting. Triple scheduler replication.
> 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ for
> this cluster.
> 2) mesos, zk came back online ok but aurora did not.
> 3) scheduler process and UI started but scheduler was unhealthy. Current
> monitoring cleared the down event because the processes were alive and
> answering 8081.
> 4) recovery was not possible until I downgraded to 0.5.0-incubating, at
> which point full recovery was made.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message