aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Isaac Councill <is...@hioscar.com>
Subject Re: monitoring aurora scheduler
Date Wed, 01 Oct 2014 18:04:08 GMT
Thanks! Comment dropped on AURORA-634.

As for the error I encountered, I saw "Storage is not READY" exceptions on
all scheduler instances, and no leader was elected. Nothing other than that
jumped out as unusual in the logs - no ZK_* warnings/errors etc.

Aurora came up before zookeeper, but aurora polled until zk was available.
Aurora also came up before a mesos master was available and committed
suicide on registration failure. Monit restarted the service eventually so
that shouldn't have been a problem.

Sadly, I've had to abandon full diagnosis due to time constraints.

On Tue, Sep 30, 2014 at 5:33 PM, Bill Farner <wfarner@apache.org> wrote:

> Firstly, please chime in on AURORA-634 to nudge us to formally document
> this.
>
> There's a wealth of instrumentation exposed at /vars on the scheduler.  To
> rattle off a few that are a good fit for monitoring:
>
> task_store_LOST
> If this value is increasing at a high rate, it's a sign of trouble.  Note:
> this one is not monotonically increasing, it will decrease when old
> terminated tasks are GCed.
>
> scheduler_resource_offers
> Must be increasing, rate will depend on cluster size and behavior of other
> frameworks.
>
> jvm_uptime_secs
> Detecting resets on this value will tell you that the scheduler is failing
> to stay alive.
>
> framework_registered
> If no schedulers have a '1' on this, then Aurora is not registered with
> mesos.
>
>
> rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)
> This gives you a moving window of log append latency,  A hike in this value
> suggests disk IOP contention
>
> timed_out_tasks
> Increase in this value indicates that Aurora is moving tasks into transient
> states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly.
>
> system_load_avg
> A high sustained value here suggests that the machine may be over-utilized.
>
> http_500_responses_events
> An increase here indicates internal server errors responding to RPCs and
> web UI loading.
>
> I'd love to know more about the specific issue you encountered.  Do the
> scheduler logs indicate anything unusual during the period of downtime?
>
>
> -=Bill
>
> On Tue, Sep 30, 2014 at 1:59 PM, Isaac Councill <isaac@hioscar.com> wrote:
>
> > I've been having a bad time with the great AWS Xen reboot, and thought it
> > would be a good time to revamp monitoring among other things.
> >
> > Do you have any recommendations for monitoring scheduler health? I've got
> > my own ideas, but am more interested in learning about twitter prod
> > monitoring.
> >
> >
> > For context, last night's failure:
> >
> > Running aurora-scheduler from head, cut last week. Could find the exact
> > commit if interesting. Triple scheduler replication.
> >
> > 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ
> for
> > this cluster.
> > 2) mesos, zk came back online ok but aurora did not.
> > 3) scheduler process and UI started but scheduler was unhealthy. Current
> > monitoring cleared the down event because the processes were alive and
> > answering 8081.
> > 4) recovery was not possible until I downgraded to 0.5.0-incubating, at
> > which point full recovery was made.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message