aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Isaac Councill <>
Subject monitoring aurora scheduler
Date Tue, 30 Sep 2014 20:59:56 GMT
I've been having a bad time with the great AWS Xen reboot, and thought it
would be a good time to revamp monitoring among other things.

Do you have any recommendations for monitoring scheduler health? I've got
my own ideas, but am more interested in learning about twitter prod

For context, last night's failure:

Running aurora-scheduler from head, cut last week. Could find the exact
commit if interesting. Triple scheduler replication.

1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ for
this cluster.
2) mesos, zk came back online ok but aurora did not.
3) scheduler process and UI started but scheduler was unhealthy. Current
monitoring cleared the down event because the processes were alive and
answering 8081.
4) recovery was not possible until I downgraded to 0.5.0-incubating, at
which point full recovery was made.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message