aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject git commit: Documenting SLA stats.
Date Wed, 25 Jun 2014 19:31:21 GMT
Repository: incubator-aurora
Updated Branches:
  refs/heads/master e374b7107 -> 9d211f620

Documenting SLA stats.

Bugs closed: AURORA-528

Reviewed at


Branch: refs/heads/master
Commit: 9d211f620d29a2331193d39faa55e46e5c42257f
Parents: e374b71
Author: Maxim Khutornenko <>
Authored: Wed Jun 25 12:31:11 2014 -0700
Committer: Maxim Khutornenko <>
Committed: Wed Jun 25 12:31:11 2014 -0700

 docs/ | 176 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 176 insertions(+)
diff --git a/docs/ b/docs/
new file mode 100644
index 0000000..14e9108
--- /dev/null
+++ b/docs/
@@ -0,0 +1,176 @@
+Aurora SLA Measurement
+- [Overview](#overview)
+- [Metric Details](#metric-details)
+  - [Platform Uptime](#platform-uptime)
+  - [Job Uptime](#job-uptime)
+  - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
+- [Limitations](#limitations)
+## Overview
+The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level
+Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform
+and hosted services.
+The Aurora SLA feature currently supports stat collection only for service (non-cron)
+production jobs (`"production = True"` in your `.aurora` config).
+Counters that track SLA measurements are computed periodically within the scheduler.
+The individual instance metrics are refreshed every minute (configurable via
+`sla_stat_refresh_interval`). The instance counters are subsequently aggregated by
+relevant grouping types before exporting to scheduler `/vars` endpoint (when using `vagrant`
+that would be ``)
+## Metric Details
+### Platform Uptime
+*Aggregate amount of time a job spends in a non-runnable state due to platform unavailability
+or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on
+system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts
+will not degrade this metric.*
+**Collection scope:**
+* Per job - `sla_<job_key>_platform_uptime_percent`
+* Per cluster - `sla_cluster_platform_uptime_percent`
+**Units:** percent
+A fault in the task environment may cause the Aurora/Mesos to have different views on the
task state
+or lose track of the task existence. In such cases, the service task is marked as LOST and
+rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING
+for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between
+task entering LOST and its replacement reaching RUNNING state is counted towards platform
+Another example of a platform downtime event is the administrator-requested task rescheduling.
+happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED
+rescheduled elsewhere.
+To accurately calculate Platform Uptime, we must separate platform incurred downtime from
+actions that put a service instance in a non-operational state. It is simpler to isolate
+user-incurred downtime and treat all other downtime as platform incurred.
+Currently, a user can cause a healthy service (task) downtime in only two ways: via `killTasks`
+or `restartShards` RPCs. For both, their affected tasks leave an audit state transition trail
+relevant to uptime calculations. By applying a special "SLA meaning" to exposed task state
+transition records, we can build a deterministic downtime trace for every given service instance.
+A task going through a state transition carries one of three possible SLA meanings
+(see [](../src/main/java/org/apache/aurora/scheduler/sla/
+sla-to-task-state mapping):
+* Task is UP: starts a period where the task is considered to be up and running from the
+  platform standpoint.
+* Task is DOWN: starts a period where the task cannot reach the UP state for some
+  non-user-related reason. Counts towards instance downtime.
+* Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to
+  user initiated action or failure. We ignore this period for the uptime calculation purposes.
+This metric is recalculated over the last sampling period (last minute) to account for
+any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the
+sampling interval as well as adjacent REMOVED events.
+### Job Uptime
+*Percentage of the job instances considered to be in RUNNING state for the specified duration
+relative to request time. This is a purely application side metric that is considering aggregate
+uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect
+this metric.*
+**Collection scope:** We currently expose job uptime values at 5 pre-defined
+percentiles (50th,75th,90th,95th and 99th):
+* `sla_<job_key>_job_uptime_50_00_sec`
+* `sla_<job_key>_job_uptime_75_00_sec`
+* `sla_<job_key>_job_uptime_90_00_sec`
+* `sla_<job_key>_job_uptime_95_00_sec`
+* `sla_<job_key>_job_uptime_99_00_sec`
+**Units:** seconds
+You can also get customized real-time stats from aurora client. See `aurora sla -h` for
+more details.
+### Median Time To Assigned (MTTA)
+*Median time a job spends waiting for its tasks to be assigned to a host. This is a combined
+metric that helps track the dependency of scheduling performance on the requested resources
+(user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform
+**Collection scope:**
+* Per job - `sla_<job_key>_mtta_ms`
+* Per cluster - `sla_cluster_mtta_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+  * By CPU:
+    * `sla_cpu_small_mtta_ms`
+    * `sla_cpu_medium_mtta_ms`
+    * `sla_cpu_large_mtta_ms`
+    * `sla_cpu_xlarge_mtta_ms`
+    * `sla_cpu_xxlarge_mtta_ms`
+  * By RAM:
+    * `sla_ram_small_mtta_ms`
+    * `sla_ram_medium_mtta_ms`
+    * `sla_ram_large_mtta_ms`
+    * `sla_ram_xlarge_mtta_ms`
+    * `sla_ram_xxlarge_mtta_ms`
+  * By DISK:
+    * `sla_disk_small_mtta_ms`
+    * `sla_disk_medium_mtta_ms`
+    * `sla_disk_large_mtta_ms`
+    * `sla_disk_xlarge_mtta_ms`
+    * `sla_disk_xxlarge_mtta_ms`
+**Units:** milliseconds
+MTTA only considers instances that have already reached ASSIGNED state and ignores those
+that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
+constraints) do not affect metric curves.
+### Median Time To Running (MTTR)
+*Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.*
+**Collection scope:**
+* Per job - `sla_<job_key>_mttr_ms`
+* Per cluster - `sla_cluster_mttr_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+  * By CPU:
+    * `sla_cpu_small_mttr_ms`
+    * `sla_cpu_medium_mttr_ms`
+    * `sla_cpu_large_mttr_ms`
+    * `sla_cpu_xlarge_mttr_ms`
+    * `sla_cpu_xxlarge_mttr_ms`
+  * By RAM:
+    * `sla_ram_small_mttr_ms`
+    * `sla_ram_medium_mttr_ms`
+    * `sla_ram_large_mttr_ms`
+    * `sla_ram_xlarge_mttr_ms`
+    * `sla_ram_xxlarge_mttr_ms`
+  * By DISK:
+    * `sla_disk_small_mttr_ms`
+    * `sla_disk_medium_mttr_ms`
+    * `sla_disk_large_mttr_ms`
+    * `sla_disk_xlarge_mttr_ms`
+    * `sla_disk_xxlarge_mttr_ms`
+**Units:** milliseconds
+MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+## Limitations
+* The availability of Aurora SLA metrics is bound by the scheduler availability.
+* All metrics are calculated at a pre-defined interval (currently set at 1 minute).
+  Scheduler restarts may result in missed collections.
\ No newline at end of file

View raw message