aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Justin Pinkul (JIRA)" <>
Subject [jira] [Commented] (AURORA-1602) Add aurora_admin command to trigger reconciliation
Date Wed, 31 Aug 2016 23:37:20 GMT


Justin Pinkul commented on AURORA-1602:

We recently hit a failure scenario in Mesos where this tool would have helped. Our cluster
experienced a full power outage recently and the Aurora scheduler's recovered faster than
all of the Mesos agents. When Aurora sent the request for explicit task reconciliation to
the Mesos master the Mesos master had to drop the request due to nodes being in a transitory
state. This state only lasted for a couple of minutes longer but Aurora did not perform reconciliation
for another reconciliation_explicit_interval (in our case the default of 60min). The impact
of this was that it took the Aurora scheduler an extra hour to reschedule existing jobs that
were lost due to the power outage. If there was a tool that could trigger this reconciliation
the cluster could have been recovered faster.

> Add aurora_admin command to trigger reconciliation 
> ---------------------------------------------------
>                 Key: AURORA-1602
>                 URL:
>             Project: Aurora
>          Issue Type: Task
>          Components: Client
>            Reporter: Zameer Manji
> Currently reconciliation runs on a fixed schedule. Adding an admin RPC to trigger it
is useful for operators who want to speed up cluster recovery.

This message was sent by Atlassian JIRA

View raw message