aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zma...@apache.org
Subject aurora git commit: Add MEDIAN_TIME_TO_STARTING as a new metric.
Date Tue, 06 Sep 2016 19:26:00 GMT
Repository: aurora
Updated Branches:
  refs/heads/master 059b08621 -> 0c90c862a


Add MEDIAN_TIME_TO_STARTING as a new metric.

A new MTTS (Median Time To Starting) metric is added to the sla module in
addition to MTTA and MTTR.

This review request is related to my previous review request:
https://reviews.apache.org/r/51536

In the new implementation, the executor starts health check at STARTING, if a
successful health check is performed before initial_interval_sec expires, it
transitions into RUNNING state. Therefore, MTTS gives us an idea of how long it
takes for a task to become active, whereas the difference between MTTR and MTTS
represents the warm-up period for a task.

See the following issues for more backgrounds:

https://issues.apache.org/jira/browse/AURORA-1221

https://issues.apache.org/jira/browse/AURORA-1222

The new metrics represents the median time spent waiting for a set of tasks to
reach STARTING status within a time frame(including the tasks turning into
RUNNING state within the time frame).

Here I regard STARTING as an active state. However, STARTING state is account
for platform and job uptime calculations.

Testing Done:
./gradlew build

./gradlew :test

./build-support/jenkins/build.sh

Reviewed at https://reviews.apache.org/r/51580/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/0c90c862
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/0c90c862
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/0c90c862

Branch: refs/heads/master
Commit: 0c90c862a14c3a5efe0fdf0f30ee41c01b96b434
Parents: 059b086
Author: Kai Huang <texasred2013@hotmail.com>
Authored: Tue Sep 6 12:26:13 2016 -0700
Committer: Zameer Manji <zmanji@apache.org>
Committed: Tue Sep 6 12:26:13 2016 -0700

----------------------------------------------------------------------
 docs/features/sla-metrics.md                    | 41 +++++++++++++-
 .../aurora/scheduler/sla/MetricCalculator.java  |  2 +
 .../aurora/scheduler/sla/SlaAlgorithm.java      |  2 +
 .../aurora/scheduler/sla/SlaAlgorithmTest.java  | 57 ++++++++++++++++++++
 4 files changed, 100 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/docs/features/sla-metrics.md
----------------------------------------------------------------------
diff --git a/docs/features/sla-metrics.md b/docs/features/sla-metrics.md
index 932b5dc..bca2ebf 100644
--- a/docs/features/sla-metrics.md
+++ b/docs/features/sla-metrics.md
@@ -6,6 +6,7 @@ Aurora SLA Measurement
   - [Platform Uptime](#platform-uptime)
   - [Job Uptime](#job-uptime)
   - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Starting (MTTS)](#median-time-to-starting-\(mtts\))
   - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
 - [Limitations](#limitations)
 
@@ -109,7 +110,7 @@ metric that helps track the dependency of scheduling performance on the
requeste
 * Per job - `sla_<job_key>_mtta_ms`
 * Per cluster - `sla_cluster_mtta_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
-[ResourceAggregates.java](../../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java)
+[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtta_ms`
     * `sla_cpu_medium_mtta_ms`
@@ -135,6 +136,42 @@ MTTA only considers instances that have already reached ASSIGNED state
and ignor
 that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
 constraints) do not affect metric curves.
 
+### Median Time To Starting (MTTS)
+
+*Median time a job waits for its tasks to reach STARTING state. This is a comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start initializing the sandbox
+for a task.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtts_ms`
+* Per cluster - `sla_cluster_mtts_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtts_ms`
+    * `sla_cpu_medium_mtts_ms`
+    * `sla_cpu_large_mtts_ms`
+    * `sla_cpu_xlarge_mtts_ms`
+    * `sla_cpu_xxlarge_mtts_ms`
+  * By RAM:
+    * `sla_ram_small_mtts_ms`
+    * `sla_ram_medium_mtts_ms`
+    * `sla_ram_large_mtts_ms`
+    * `sla_ram_xlarge_mtts_ms`
+    * `sla_ram_xxlarge_mtts_ms`
+  * By DISK:
+    * `sla_disk_small_mtts_ms`
+    * `sla_disk_medium_mtts_ms`
+    * `sla_disk_large_mtts_ms`
+    * `sla_disk_xlarge_mtts_ms`
+    * `sla_disk_xxlarge_mtts_ms`
+
+**Units:** milliseconds
+
+MTTS only considers instances in STARTING state. This ensures straggler instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
 ### Median Time To Running (MTTR)
 
 *Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
@@ -145,7 +182,7 @@ reflecting on the overall time it takes for the Aurora/Mesos to start
executing
 * Per job - `sla_<job_key>_mttr_ms`
 * Per cluster - `sla_cluster_mttr_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
-[ResourceAggregates.java](../../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java)
+[ResourceBag.java](../../src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mttr_ms`
     * `sla_cpu_medium_mttr_ms`

http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java
----------------------------------------------------------------------
diff --git a/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java b/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java
index 3ddac8b..9a56cda 100644
--- a/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java
+++ b/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java
@@ -54,6 +54,7 @@ import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPT
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPTIME_99;
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_ASSIGNED;
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_RUNNING;
+import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_STARTING;
 import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.CLUSTER;
 import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.JOB;
 import static org.apache.aurora.scheduler.sla.SlaGroup.GroupType.RESOURCE_CPU;
@@ -88,6 +89,7 @@ class MetricCalculator implements Runnable {
         .build()),
     MEDIANS(ImmutableMultimap.<AlgorithmType, GroupType>builder()
         .putAll(MEDIAN_TIME_TO_ASSIGNED, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK)
+        .putAll(MEDIAN_TIME_TO_STARTING, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK)
         .putAll(MEDIAN_TIME_TO_RUNNING, JOB, CLUSTER, RESOURCE_CPU, RESOURCE_RAM, RESOURCE_DISK)
         .build());
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java
----------------------------------------------------------------------
diff --git a/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java b/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java
index 4f243aa..263647e 100644
--- a/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java
+++ b/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java
@@ -43,6 +43,7 @@ import static java.util.Objects.requireNonNull;
 import static org.apache.aurora.gen.ScheduleStatus.ASSIGNED;
 import static org.apache.aurora.gen.ScheduleStatus.PENDING;
 import static org.apache.aurora.gen.ScheduleStatus.RUNNING;
+import static org.apache.aurora.gen.ScheduleStatus.STARTING;
 
 /**
  * Defines an SLA algorithm to be applied to a {@link IScheduledTask}
@@ -72,6 +73,7 @@ interface SlaAlgorithm {
     JOB_UPTIME_50(new JobUptime(50f), String.format(JobUptime.NAME_FORMAT, 50f)),
     AGGREGATE_PLATFORM_UPTIME(new AggregatePlatformUptime(), "platform_uptime_percent"),
     MEDIAN_TIME_TO_ASSIGNED(new MedianAlgorithm(ASSIGNED), "mtta_ms"),
+    MEDIAN_TIME_TO_STARTING(new MedianAlgorithm(STARTING), "mtts_ms"),
     MEDIAN_TIME_TO_RUNNING(new MedianAlgorithm(RUNNING), "mttr_ms");
 
     private final SlaAlgorithm algorithm;

http://git-wip-us.apache.org/repos/asf/aurora/blob/0c90c862/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java
----------------------------------------------------------------------
diff --git a/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java b/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java
index 90ea3a1..eca1bee 100644
--- a/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java
+++ b/src/test/java/org/apache/aurora/scheduler/sla/SlaAlgorithmTest.java
@@ -43,6 +43,7 @@ import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPT
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.JOB_UPTIME_99;
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_ASSIGNED;
 import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_RUNNING;
+import static org.apache.aurora.scheduler.sla.SlaAlgorithm.AlgorithmType.MEDIAN_TIME_TO_STARTING;
 import static org.junit.Assert.assertEquals;
 
 public class SlaAlgorithmTest {
@@ -98,6 +99,62 @@ public class SlaAlgorithmTest {
   }
 
   @Test
+  public void testMedianTimeToStartingEven() {
+    Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate(
+        ImmutableSet.of(
+            makeTask(ImmutableMap.of(50L, PENDING)), // Ignored as not RUNNING
+            makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 150L, STARTING)),
+            makeTask(ImmutableMap.of(100L, PENDING, 200L, ASSIGNED, 300L, STARTING, 400L,
RUNNING)),
+            makeTask(ImmutableMap.of(
+                100L, PENDING,
+                200L, ASSIGNED,
+                300L, STARTING,
+                400L, KILLED)), // Ignored due to being terminal.
+            makeTask(ImmutableMap.of(
+                50L, PENDING,
+                100L, ASSIGNED,
+                150L, STARTING,
+                200L, RUNNING,
+                300L, KILLED))), // Ignored due to being terminal.
+        Range.closedOpen(0L, 500L));
+    assertEquals(100L, actual);
+  }
+
+  @Test
+  public void testMedianTimeToStartingOdd() {
+    Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate(
+        ImmutableSet.of(
+            makeTask(ImmutableMap.of(50L, PENDING)), // Ignored as not RUNNING
+            makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 150L, STARTING)),
+            makeTask(ImmutableMap.of(100L, PENDING, 200L, ASSIGNED, 300L, STARTING, 400L,
RUNNING)),
+            makeTask(ImmutableMap.of(50L, PENDING, 100L, ASSIGNED, 350L, STARTING)),
+            makeTask(ImmutableMap.of(
+                100L, PENDING,
+                200L, ASSIGNED,
+                300L, STARTING,
+                400L, KILLED)), // Ignored due to being terminal.
+            makeTask(ImmutableMap.of(
+                50L, PENDING,
+                100L, ASSIGNED,
+                150L, STARTING,
+                200L, RUNNING,
+                300L, KILLED))), // Ignored due to being terminal.
+        Range.closedOpen(0L, 500L));
+    assertEquals(200L, actual);
+  }
+
+  @Test
+  public void testMedianTimeToStartingZero() {
+    Number actual = MEDIAN_TIME_TO_STARTING.getAlgorithm().calculate(
+        ImmutableSet.of(
+            makeTask(ImmutableMap.of(50L, PENDING)),
+            makeTask(ImmutableMap.of(50L, PENDING, 100L, STARTING, 200L, RUNNING, 300L, KILLED)),
+            makeTask(ImmutableMap.of(50L, PENDING, 100L, STARTING, 200L, KILLED))),
+        Range.closedOpen(0L, 500L));
+    assertEquals(0L, actual);
+  }
+
+  @Test
   public void testMedianTimeToRunningEven() {
     Number actual = MEDIAN_TIME_TO_RUNNING.getAlgorithm().calculate(
         ImmutableSet.of(


Mime
View raw message