aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jco...@apache.org
Subject svn commit: r1762695 [16/19] - in /aurora/site: data/ publish/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publish/documentation/0.10.0/client-commands/ publish/doc...
Date Wed, 28 Sep 2016 18:23:59 GMT
Added: aurora/site/source/documentation/0.16.0/features/resource-isolation.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/features/resource-isolation.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/features/resource-isolation.md (added)
+++ aurora/site/source/documentation/0.16.0/features/resource-isolation.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,181 @@
+Resources Isolation and Sizing
+==============================
+
+This document assumes Aurora and Mesos have been configured
+using our [recommended resource isolation settings](../../operations/configuration/#resource-isolation).
+
+- [Isolation](#isolation)
+- [Sizing](#sizing)
+- [Oversubscription](#oversubscription)
+
+
+Isolation
+---------
+
+Aurora is a multi-tenant system; a single software instance runs on a
+server, serving multiple clients/tenants. To share resources among
+tenants, it leverages Mesos for isolation of:
+
+* CPU
+* GPU
+* memory
+* disk space
+* ports
+
+CPU is a soft limit, and handled differently from memory and disk space.
+Too low a CPU value results in throttling your application and
+slowing it down. Memory and disk space are both hard limits; when your
+application goes over these values, it's killed.
+
+### CPU Isolation
+
+Mesos can be configured to use a quota based CPU scheduler (the *Completely*
+*Fair Scheduler*) to provide consistent and predictable performance.
+This is effectively a guarantee of resources -- you receive at least what
+you requested, but also no more than you've requested.
+
+The scheduler gives applications a CPU quota for every 100 ms interval.
+When an application uses its quota for an interval, it is throttled for
+the rest of the 100 ms. Usage resets for each interval and unused
+quota does not carry over.
+
+For example, an application specifying 4.0 CPU has access to 400 ms of
+CPU time every 100 ms. This CPU quota can be used in different ways,
+depending on the application and available resources. Consider the
+scenarios shown in this diagram.
+
+![CPU Availability](../images/CPUavailability.png)
+
+* *Scenario A*: the application can use up to 4 cores continuously for
+every 100 ms interval. It is never throttled and starts processing
+new requests immediately.
+
+* *Scenario B* : the application uses up to 8 cores (depending on
+availability) but is throttled after 50 ms. The CPU quota resets at the
+start of each new 100 ms interval.
+
+* *Scenario C* : is like Scenario A, but there is a garbage collection
+event in the second interval that consumes all CPU quota. The
+application throttles for the remaining 75 ms of that interval and
+cannot service requests until the next interval. In this example, the
+garbage collection finished in one interval but, depending on how much
+garbage needs collecting, it may take more than one interval and further
+delay service of requests.
+
+*Technical Note*: Mesos considers logical cores, also known as
+hyperthreading or SMT cores, as the unit of CPU.
+
+### Memory Isolation
+
+Mesos uses dedicated memory allocation. Your application always has
+access to the amount of memory specified in your configuration. The
+application's memory use is defined as the sum of the resident set size
+(RSS) of all processes in a shard. Each shard is considered
+independently.
+
+In other words, say you specified a memory size of 10GB. Each shard
+would receive 10GB of memory. If an individual shard's memory demands
+exceed 10GB, that shard is killed, but the other shards continue
+working.
+
+*Technical note*: Total memory size is not enforced at allocation time,
+so your application can request more than its allocation without getting
+an ENOMEM. However, it will be killed shortly after.
+
+### Disk Space
+
+Disk space used by your application is defined as the sum of the files'
+disk space in your application's directory, including the `stdout` and
+`stderr` logged from your application. Each shard is considered
+independently. You should use off-node storage for your application's
+data whenever possible.
+
+In other words, say you specified disk space size of 100MB. Each shard
+would receive 100MB of disk space. If an individual shard's disk space
+demands exceed 100MB, that shard is killed, but the other shards
+continue working.
+
+After your application finishes running, its allocated disk space is
+reclaimed. Thus, your job's final action should move any disk content
+that you want to keep, such as logs, to your home file system or other
+less transitory storage. Disk reclamation takes place an undefined
+period after the application finish time; until then, the disk contents
+are still available but you shouldn't count on them being so.
+
+*Technical note* : Disk space is not enforced at write so your
+application can write above its quota without getting an ENOSPC, but it
+will be killed shortly after. This is subject to change.
+
+### GPU Isolation
+
+GPU isolation will be supported for Nvidia devices starting from Mesos 1.0.
+Access to the allocated units will be exclusive with no sharing between tasks
+allowed (e.g. no fractional GPU allocation). For more details, see the
+[Mesos design document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
+and the [Mesos agent configuration](http://mesos.apache.org/documentation/latest/configuration/).
+
+### Other Resources
+
+Other resources, such as network bandwidth, do not have any performance
+guarantees. For some resources, such as memory bandwidth, there are no
+practical sharing methods so some application combinations collocated on
+the same host may cause contention.
+
+
+Sizing
+-------
+
+### CPU Sizing
+
+To correctly size Aurora-run Mesos tasks, specify a per-shard CPU value
+that lets the task run at its desired performance when at peak load
+distributed across all shards. Include reserve capacity of at least 50%,
+possibly more, depending on how critical your service is (or how
+confident you are about your original estimate : -)), ideally by
+increasing the number of shards to also improve resiliency. When running
+your application, observe its CPU stats over time. If consistently at or
+near your quota during peak load, you should consider increasing either
+per-shard CPU or the number of shards.
+
+## Memory Sizing
+
+Size for your application's peak requirement. Observe the per-instance
+memory statistics over time, as memory requirements can vary over
+different periods. Remember that if your application exceeds its memory
+value, it will be killed, so you should also add a safety margin of
+around 10-20%. If you have the ability to do so, you may also want to
+put alerts on the per-instance memory.
+
+## Disk Space Sizing
+
+Size for your application's peak requirement. Rotate and discard log
+files as needed to stay within your quota. When running a Java process,
+add the maximum size of the Java heap to your disk space requirement, in
+order to account for an out of memory error dumping the heap
+into the application's sandbox space.
+
+## GPU Sizing
+
+GPU is highly dependent on your application requirements and is only limited
+by the number of physical GPU units available on a target box.
+
+
+Oversubscription
+----------------
+
+Mesos supports [oversubscription of machine resources](http://mesos.apache.org/documentation/latest/oversubscription/)
+via the concept of revocable tasks. In contrast to non-revocable tasks, revocable tasks are best-effort.
+Mesos reserves the right to throttle or even kill them if they might affect existing high-priority
+user-facing services.
+
+As of today, the only revocable resource supported by Aurora are CPU and RAM resources. A job can
+opt-in to use those by specifying the `revocable` [Configuration Tier](../../features/multitenancy/#configuration-tiers).
+A revocable job will only be scheduled using revocable resources, even if there are plenty of
+non-revocable resources available.
+
+The Aurora scheduler must be [configured to receive revocable offers](../../operations/configuration/#resource-isolation)
+from Mesos and accept revocable jobs. If not configured properly revocable tasks will never get
+assigned to hosts and will stay in `PENDING`.
+
+For details on how to mark a job as being revocable, see the
+[Configuration Reference](../../reference/configuration/).

Added: aurora/site/source/documentation/0.16.0/features/service-discovery.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/features/service-discovery.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/features/service-discovery.md (added)
+++ aurora/site/source/documentation/0.16.0/features/service-discovery.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,42 @@
+Service Discovery
+=================
+
+It is possible for the Aurora executor to announce tasks into ServerSets for
+the purpose of service discovery.  ServerSets use the Zookeeper [group membership pattern](http://zookeeper.apache.org/doc/trunk/recipes.html#sc_outOfTheBox)
+of which there are several reference implementations:
+
+  - [C++](https://github.com/apache/mesos/blob/master/src/zookeeper/group.cpp)
+  - [Java](https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/zookeeper/ServerSetImpl.java#L221)
+  - [Python](https://github.com/twitter/commons/blob/master/src/python/twitter/common/zookeeper/serverset/serverset.py#L51)
+
+These can also be used natively in Finagle using the [ZookeeperServerSetCluster](https://github.com/twitter/finagle/blob/master/finagle-serversets/src/main/scala/com/twitter/finagle/zookeeper/ZookeeperServerSetCluster.scala).
+
+For more information about how to configure announcing, see the [Configuration Reference](../../reference/configuration/).
+
+Using Mesos DiscoveryInfo
+-------------------------
+Experimental support for populating DiscoveryInfo in Mesos is introduced in Aurora. This can be used to build
+custom service discovery system not using zookeeper. Please see `Service Discovery` section in
+[Mesos Framework Development guide](http://mesos.apache.org/documentation/latest/app-framework-development-guide/) for
+explanation of the protobuf message in Mesos.
+
+To use this feature, please enable `--populate_discovery_info` flag on scheduler. All jobs started by scheduler
+afterwards will have their portmap populated to Mesos and discoverable in `/state` endpoint in Mesos master and agent.
+
+### Using Mesos DNS
+An example is using [Mesos-DNS](https://github.com/mesosphere/mesos-dns), which is able to generate multiple DNS
+records. With current implementation, the example job with key `devcluster/vagrant/test/http-example` generates at
+least the following:
+
+1. An A record for `http_example.test.vagrant.aurora.mesos` (which only includes IP address);
+2. A [SRV record](https://en.wikipedia.org/wiki/SRV_record) for
+ `_http_example.test.vagrant._tcp.aurora.mesos`, which includes IP address and every port. This should only
+  be used if the service has one port.
+3. A SRV record `_{port-name}._http_example.test.vagrant._tcp.aurora.mesos` for each port name
+  defined. This should be used when the service has multiple ports.
+
+Things to note:
+
+1. The domain part (".mesos" in above example) can be configured in [Mesos DNS](http://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html);
+2. Right now, portmap and port aliases in announcer object are not reflected in DiscoveryInfo, therefore not visible in
+   Mesos DNS records either. This is because they are only resolved in thermos executors.

Added: aurora/site/source/documentation/0.16.0/features/services.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/features/services.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/features/services.md (added)
+++ aurora/site/source/documentation/0.16.0/features/services.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,99 @@
+Long-running Services
+=====================
+
+Jobs that are always restart on completion, whether successful or unsuccessful,
+are called services. This is useful for long-running processes
+such as webservices that should always be running, unless stopped explicitly.
+
+
+Service Specification
+---------------------
+
+A job is identified as a service by the presence of the flag
+``service=True` in the [`Job`](../../reference/configuration/#job-objects) object.
+The `Service` alias can be used as shorthand for `Job` with `service=True`.
+
+Example (available in the [Vagrant environment](../../getting-started/vagrant/)):
+
+    $ cat /vagrant/examples/jobs/hello_world.aurora
+    hello = Process(
+      name = 'hello',
+      cmdline = """
+        while true; do
+          echo hello world
+          sleep 10
+        done
+      """)
+
+    task = SequentialTask(
+      processes = [hello],
+      resources = Resources(cpu = 1.0, ram = 128*MB, disk = 128*MB)
+    )
+
+    jobs = [
+      Service(
+        task = task,
+        cluster = 'devcluster',
+        role = 'www-data',
+        environment = 'prod',
+        name = 'hello'
+      )
+    ]
+
+
+Jobs without the service bit set only restart up to `max_task_failures` times and only if they
+terminated unsuccessfully either due to human error or machine failure (see the
+[`Job`](../../reference/configuration/#job-objects) object for details).
+
+
+Ports
+-----
+
+In order to be useful, most services have to bind to one or more ports. Aurora enables this
+usecase via the [`thermos.ports` namespace](../../reference/configuration/#thermos-namespace) that
+allows to request arbitrarily named ports:
+
+
+    nginx = Process(
+      name = 'nginx',
+      cmdline = './run_nginx.sh -port {{thermos.ports[http]}}'
+    )
+
+
+When this process is included in a job, the job will be allocated a port, and the command line
+will be replaced with something like:
+
+    ./run_nginx.sh -port 42816
+
+Where 42816 happens to be the allocated port.
+
+For details on how to enable clients to discover this dynamically assigned port, see our
+[Service Discovery](../service-discovery/) documentation.
+
+
+Health Checking
+---------------
+
+Typically, the Thermos executor monitors processes within a task only by liveness of the forked
+process. In addition to that, Aurora has support for rudimentary health checking: Either via HTTP
+via custom shell scripts.
+
+For example, simply by requesting a `health` port, a process can request to be health checked
+via repeated calls to the `/health` endpoint:
+
+    nginx = Process(
+      name = 'nginx',
+      cmdline = './run_nginx.sh -port {{thermos.ports[health]}}'
+    )
+
+Please see the
+[configuration reference](../../reference/configuration/#healthcheckconfig-objects)
+for configuration options for this feature.
+
+You can pause health checking by touching a file inside of your sandbox, named `.healthchecksnooze`.
+As long as that file is present, health checks will be disabled, enabling users to gather core
+dumps or other performance measurements without worrying about Aurora's health check killing
+their process.
+
+WARNING: Remember to remove this when you are done, otherwise your instance will have permanently
+disabled health checks.

Added: aurora/site/source/documentation/0.16.0/features/sla-metrics.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/features/sla-metrics.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/features/sla-metrics.md (added)
+++ aurora/site/source/documentation/0.16.0/features/sla-metrics.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,215 @@
+Aurora SLA Measurement
+======================
+
+- [Overview](#overview)
+- [Metric Details](#metric-details)
+  - [Platform Uptime](#platform-uptime)
+  - [Job Uptime](#job-uptime)
+  - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Starting (MTTS)](#median-time-to-starting-\(mtts\))
+  - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
+- [Limitations](#limitations)
+
+## Overview
+
+The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level
+Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform
+and hosted services.
+
+The Aurora SLA feature is by default only enabled for service (non-cron)
+production jobs (`"production=True"` in your `.aurora` config). It can be enabled for
+non-production services by an operator via the scheduler command line flag `-sla_non_prod_metrics`.
+
+Counters that track SLA measurements are computed periodically within the scheduler.
+The individual instance metrics are refreshed every minute (configurable via
+`sla_stat_refresh_interval`). The instance counters are subsequently aggregated by
+relevant grouping types before exporting to scheduler `/vars` endpoint (when using `vagrant`
+that would be `http://192.168.33.7:8081/vars`)
+
+
+## Metric Details
+
+### Platform Uptime
+
+*Aggregate amount of time a job spends in a non-runnable state due to platform unavailability
+or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any
+system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts
+will not degrade this metric.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_platform_uptime_percent`
+* Per cluster - `sla_cluster_platform_uptime_percent`
+
+**Units:** percent
+
+A fault in the task environment may cause the Aurora/Mesos to have different views on the task state
+or lose track of the task existence. In such cases, the service task is marked as LOST and
+rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING
+for too long or the Mesos agent becomes unhealthy (or disappears completely). The time between
+task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.
+
+Another example of a platform downtime event is the administrator-requested task rescheduling. This
+happens during planned Mesos agent maintenance when all agent tasks are marked as DRAINED and
+rescheduled elsewhere.
+
+To accurately calculate Platform Uptime, we must separate platform incurred downtime from user
+actions that put a service instance in a non-operational state. It is simpler to isolate
+user-incurred downtime and treat all other downtime as platform incurred.
+
+Currently, a user can cause a healthy service (task) downtime in only two ways: via `killTasks`
+or `restartShards` RPCs. For both, their affected tasks leave an audit state transition trail
+relevant to uptime calculations. By applying a special "SLA meaning" to exposed task state
+transition records, we can build a deterministic downtime trace for every given service instance.
+
+A task going through a state transition carries one of three possible SLA meanings
+(see [SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java) for
+sla-to-task-state mapping):
+
+* Task is UP: starts a period where the task is considered to be up and running from the Aurora
+  platform standpoint.
+
+* Task is DOWN: starts a period where the task cannot reach the UP state for some
+  non-user-related reason. Counts towards instance downtime.
+
+* Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to
+  user initiated action or failure. We ignore this period for the uptime calculation purposes.
+
+This metric is recalculated over the last sampling period (last minute) to account for
+any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the
+sampling interval as well as adjacent REMOVED events.
+
+### Job Uptime
+
+*Percentage of the job instances considered to be in RUNNING state for the specified duration
+relative to request time. This is a purely application side metric that is considering aggregate
+uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect
+this metric.*
+
+**Collection scope:** We currently expose job uptime values at 5 pre-defined
+percentiles (50th,75th,90th,95th and 99th):
+
+* `sla_<job_key>_job_uptime_50_00_sec`
+* `sla_<job_key>_job_uptime_75_00_sec`
+* `sla_<job_key>_job_uptime_90_00_sec`
+* `sla_<job_key>_job_uptime_95_00_sec`
+* `sla_<job_key>_job_uptime_99_00_sec`
+
+**Units:** seconds
+You can also get customized real-time stats from aurora client. See `aurora sla -h` for
+more details.
+
+### Median Time To Assigned (MTTA)
+
+*Median time a job spends waiting for its tasks to be assigned to a host. This is a combined
+metric that helps track the dependency of scheduling performance on the requested resources
+(user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtta_ms`
+* Per cluster - `sla_cluster_mtta_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtta_ms`
+    * `sla_cpu_medium_mtta_ms`
+    * `sla_cpu_large_mtta_ms`
+    * `sla_cpu_xlarge_mtta_ms`
+    * `sla_cpu_xxlarge_mtta_ms`
+  * By RAM:
+    * `sla_ram_small_mtta_ms`
+    * `sla_ram_medium_mtta_ms`
+    * `sla_ram_large_mtta_ms`
+    * `sla_ram_xlarge_mtta_ms`
+    * `sla_ram_xxlarge_mtta_ms`
+  * By DISK:
+    * `sla_disk_small_mtta_ms`
+    * `sla_disk_medium_mtta_ms`
+    * `sla_disk_large_mtta_ms`
+    * `sla_disk_xlarge_mtta_ms`
+    * `sla_disk_xxlarge_mtta_ms`
+
+**Units:** milliseconds
+
+MTTA only considers instances that have already reached ASSIGNED state and ignores those
+that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
+constraints) do not affect metric curves.
+
+### Median Time To Starting (MTTS)
+
+*Median time a job waits for its tasks to reach STARTING state. This is a comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start initializing the sandbox
+for a task.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtts_ms`
+* Per cluster - `sla_cluster_mtts_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mtts_ms`
+    * `sla_cpu_medium_mtts_ms`
+    * `sla_cpu_large_mtts_ms`
+    * `sla_cpu_xlarge_mtts_ms`
+    * `sla_cpu_xxlarge_mtts_ms`
+  * By RAM:
+    * `sla_ram_small_mtts_ms`
+    * `sla_ram_medium_mtts_ms`
+    * `sla_ram_large_mtts_ms`
+    * `sla_ram_xlarge_mtts_ms`
+    * `sla_ram_xxlarge_mtts_ms`
+  * By DISK:
+    * `sla_disk_small_mtts_ms`
+    * `sla_disk_medium_mtts_ms`
+    * `sla_disk_large_mtts_ms`
+    * `sla_disk_xlarge_mtts_ms`
+    * `sla_disk_xxlarge_mtts_ms`
+
+**Units:** milliseconds
+
+MTTS only considers instances in STARTING state. This ensures straggler instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+### Median Time To Running (MTTR)
+
+*Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mttr_ms`
+* Per cluster - `sla_cluster_mttr_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.16.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+  * By CPU:
+    * `sla_cpu_small_mttr_ms`
+    * `sla_cpu_medium_mttr_ms`
+    * `sla_cpu_large_mttr_ms`
+    * `sla_cpu_xlarge_mttr_ms`
+    * `sla_cpu_xxlarge_mttr_ms`
+  * By RAM:
+    * `sla_ram_small_mttr_ms`
+    * `sla_ram_medium_mttr_ms`
+    * `sla_ram_large_mttr_ms`
+    * `sla_ram_xlarge_mttr_ms`
+    * `sla_ram_xxlarge_mttr_ms`
+  * By DISK:
+    * `sla_disk_small_mttr_ms`
+    * `sla_disk_medium_mttr_ms`
+    * `sla_disk_large_mttr_ms`
+    * `sla_disk_xlarge_mttr_ms`
+    * `sla_disk_xxlarge_mttr_ms`
+
+**Units:** milliseconds
+
+MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+## Limitations
+
+* The availability of Aurora SLA metrics is bound by the scheduler availability.
+
+* All metrics are calculated at a pre-defined interval (currently set at 1 minute).
+  Scheduler restarts may result in missed collections.

Added: aurora/site/source/documentation/0.16.0/features/webhooks.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/features/webhooks.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/features/webhooks.md (added)
+++ aurora/site/source/documentation/0.16.0/features/webhooks.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,80 @@
+Webhooks
+========
+
+Aurora has an optional feature which allows operator to specify a file to configure a HTTP webhook
+to receive task state change events. It can be enabled with a scheduler flag eg
+`-webhook_config=/path/to/webhook.json`. At this point, webhooks are still considered *experimental*.
+
+Below is a sample configuration:
+
+```json
+{
+  "headers": {
+    "Content-Type": "application/vnd.kafka.json.v1+json",
+    "Producer-Type": "reliable"
+  },
+  "targetURL": "http://localhost:5000/",
+  "timeoutMsec": 5
+}
+```
+
+And an example of a response that you will get back:
+```json
+{
+    "task":
+    {
+        "cachedHashCode":0,
+        "assignedTask": {
+            "cachedHashCode":0,
+            "taskId":"vagrant-test-http_example-8-a6cf7ec5-d793-49c7-b10f-0e14ab80bfff",
+            "task": {
+                "cachedHashCode":-1819348376,
+                "job": {
+                    "cachedHashCode":803049425,
+                    "role":"vagrant",
+                    "environment":"test",
+                    "name":"http_example"
+                    },
+                "owner": {
+                    "cachedHashCode":226895216,
+                    "user":"vagrant"
+                    },
+                "isService":true,
+                "numCpus":0.1,
+                "ramMb":16,
+                "diskMb":8,
+                "priority":0,
+                "maxTaskFailures":1,
+                "production":false,
+                "resources":[
+                    {"cachedHashCode":729800451,"setField":"NUM_CPUS","value":0.1},
+                    {"cachedHashCode":552899914,"setField":"RAM_MB","value":16},
+                    {"cachedHashCode":-1547868317,"setField":"DISK_MB","value":8},
+                    {"cachedHashCode":1957328227,"setField":"NAMED_PORT","value":"http"},
+                    {"cachedHashCode":1954229436,"setField":"NAMED_PORT","value":"tcp"}
+                    ],
+                "constraints":[],
+                "requestedPorts":["http","tcp"],
+                "taskLinks":{"http":"http://%host%:%port:http%"},
+                "contactEmail":"vagrant@localhost",
+                "executorConfig": {
+                    "cachedHashCode":-1194797325,
+                    "name":"AuroraExecutor",
+                    "data": "{\"environment\": \"test\", \"health_check_config\": {\"initial_interval_secs\": 5.0, \"health_checker\": { \"http\": {\"expected_response_code\": 0, \"endpoint\": \"/health\", \"expected_response\": \"ok\"}}, \"max_consecutive_failures\": 0, \"timeout_secs\": 1.0, \"interval_secs\": 1.0}, \"name\": \"http_example\", \"service\": true, \"max_task_failures\": 1, \"cron_collision_policy\": \"KILL_EXISTING\", \"enable_hooks\": false, \"cluster\": \"devcluster\", \"task\": {\"processes\": [{\"daemon\": false, \"name\": \"echo_ports\", \"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"echo \\\"tcp port: {{thermos.ports[tcp]}}; http port: {{thermos.ports[http]}}; alias: {{thermos.ports[alias]}}\\\"\", \"final\": false}, {\"daemon\": false, \"name\": \"stage_server\", \"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"cp /vagrant/src/test/sh/org/apache/aurora/e2e/http_example.py .\", \"final\": false}, {\
 "daemon\": false, \"name\": \"run_server\", \"ephemeral\": false, \"max_failures\": 1, \"min_duration\": 5, \"cmdline\": \"python http_example.py {{thermos.ports[http]}}\", \"final\": false}], \"name\": \"http_example\", \"finalization_wait\": 30, \"max_failures\": 1, \"max_concurrency\": 0, \"resources\": {\"disk\": 8388608, \"ram\": 16777216, \"cpu\": 0.1}, \"constraints\": [{\"order\": [\"echo_ports\", \"stage_server\", \"run_server\"]}]}, \"production\": false, \"role\": \"vagrant\", \"contact\": \"vagrant@localhost\", \"announce\": {\"primary_port\": \"http\", \"portmap\": {\"alias\": \"http\"}}, \"lifecycle\": {\"http\": {\"graceful_shutdown_endpoint\": \"/quitquitquit\", \"port\": \"health\", \"shutdown_endpoint\": \"/abortabortabort\"}}, \"priority\": 0}"},
+                    "metadata":[],
+                    "container":{
+                        "cachedHashCode":-1955376216,
+                        "setField":"MESOS",
+                        "value":{"cachedHashCode":31}}
+                    },
+                    "assignedPorts":{},
+                    "instanceId":8
+        },
+        "status":"PENDING",
+        "failureCount":0,
+        "taskEvents":[
+            {"cachedHashCode":0,"timestamp":1464992060258,"status":"PENDING","scheduler":"aurora"}]
+        },
+        "oldState":{}}
+```
+

Added: aurora/site/source/documentation/0.16.0/getting-started/overview.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/getting-started/overview.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/getting-started/overview.md (added)
+++ aurora/site/source/documentation/0.16.0/getting-started/overview.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,112 @@
+Aurora System Overview
+======================
+
+Apache Aurora is a service scheduler that runs on top of Apache Mesos, enabling you to run
+long-running services, cron jobs, and ad-hoc jobs that take advantage of Apache Mesos' scalability,
+fault-tolerance, and resource isolation.
+
+
+Components
+----------
+
+It is important to have an understanding of the components that make up
+a functioning Aurora cluster.
+
+![Aurora Components](../images/components.png)
+
+* **Aurora scheduler**
+  The scheduler is your primary interface to the work you run in your cluster.  You will
+  instruct it to run jobs, and it will manage them in Mesos for you.  You will also frequently use
+  the scheduler's read-only web interface as a heads-up display for what's running in your cluster.
+
+* **Aurora client**
+  The client (`aurora` command) is a command line tool that exposes primitives that you can use to
+  interact with the scheduler. The client operates on
+
+  Aurora also provides an admin client (`aurora_admin` command) that contains commands built for
+  cluster administrators.  You can use this tool to do things like manage user quotas and manage
+  graceful maintenance on machines in cluster.
+
+* **Aurora executor**
+  The executor (a.k.a. Thermos executor) is responsible for carrying out the workloads described in
+  the Aurora DSL (`.aurora` files).  The executor is what actually executes user processes.  It will
+  also perform health checking of tasks and register tasks in ZooKeeper for the purposes of dynamic
+  service discovery.
+
+* **Aurora observer**
+  The observer provides browser-based access to the status of individual tasks executing on worker
+  machines.  It gives insight into the processes executing, and facilitates browsing of task sandbox
+  directories.
+
+* **ZooKeeper**
+  [ZooKeeper](http://zookeeper.apache.org) is a distributed consensus system.  In an Aurora cluster
+  it is used for reliable election of the leading Aurora scheduler and Mesos master.  It is also
+  used as a vehicle for service discovery, see [Service Discovery](../../features/service-discovery/)
+
+* **Mesos master**
+  The master is responsible for tracking worker machines and performing accounting of their
+  resources.  The scheduler interfaces with the master to control the cluster.
+
+* **Mesos agent**
+  The agent receives work assigned by the scheduler and executes them.  It interfaces with Linux
+  isolation systems like cgroups, namespaces and Docker to manage the resource consumption of tasks.
+  When a user task is launched, the agent will launch the executor (in the context of a Linux cgroup
+  or Docker container depending upon the environment), which will in turn fork user processes.
+
+  In earlier versions of Mesos and Aurora, the Mesos agent was known as the Mesos slave.
+
+
+Jobs, Tasks and Processes
+--------------------------
+
+Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos
+cares about individual *tasks*, but typical jobs consist of dozens or
+hundreds of task replicas. Aurora provides a layer on top of Mesos with
+its `Job` abstraction. An Aurora `Job` consists of a task template and
+instructions for creating near-identical replicas of that task (modulo
+things like "instance id" or specific port numbers which may differ from
+machine to machine).
+
+How many tasks make up a Job is complicated. On a basic level, a Job consists of
+one task template and instructions for creating near-identical replicas of that task
+(otherwise referred to as "instances" or "shards").
+
+A task can merely be a single *process* corresponding to a single
+command line, such as `python2.7 my_script.py`. However, a task can also
+consist of many separate processes, which all run within a single
+sandbox. For example, running multiple cooperating agents together,
+such as `logrotate`, `installer`, master, or agent processes. This is
+where Thermos comes in. While Aurora provides a `Job` abstraction on
+top of Mesos `Tasks`, Thermos provides a `Process` abstraction
+underneath Mesos `Task`s and serves as part of the Aurora framework's
+executor.
+
+You define `Job`s,` Task`s, and `Process`es in a configuration file.
+Configuration files are written in Python, and make use of the
+[Pystachio](https://github.com/wickman/pystachio) templating language,
+along with specific Aurora, Mesos, and Thermos commands and methods.
+The configuration files typically end with a `.aurora` extension.
+
+Summary:
+
+* Aurora manages jobs made of tasks.
+* Mesos manages tasks made of processes.
+* Thermos manages processes.
+* All that is defined in `.aurora` configuration files
+
+![Aurora hierarchy](../images/aurora_hierarchy.png)
+
+Each `Task` has a *sandbox* created when the `Task` starts and garbage
+collected when it finishes. All of a `Task'`s processes run in its
+sandbox, so processes can share state by using a shared current working
+directory.
+
+The sandbox garbage collection policy considers many factors, most
+importantly age and size. It makes a best-effort attempt to keep
+sandboxes around as long as possible post-task in order for service
+owners to inspect data and logs, should the `Task` have completed
+abnormally. But you can't design your applications assuming sandboxes
+will be around forever, e.g. by building log saving or other
+checkpointing mechanisms directly into your application or into your
+`Job` description.
+

Added: aurora/site/source/documentation/0.16.0/getting-started/tutorial.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/getting-started/tutorial.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/getting-started/tutorial.md (added)
+++ aurora/site/source/documentation/0.16.0/getting-started/tutorial.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,258 @@
+# Aurora Tutorial
+
+This tutorial shows how to use the Aurora scheduler to run (and "`printf-debug`")
+a hello world program on Mesos. This is the recommended document for new Aurora users
+to start getting up to speed on the system.
+
+- [Prerequisite](#setup-install-aurora)
+- [The Script](#the-script)
+- [Aurora Configuration](#aurora-configuration)
+- [Creating the Job](#creating-the-job)
+- [Watching the Job Run](#watching-the-job-run)
+- [Cleanup](#cleanup)
+- [Next Steps](#next-steps)
+
+
+## Prerequisite
+
+This tutorial assumes you are running [Aurora locally using Vagrant](../vagrant/).
+However, in general the instructions are also applicable to any other
+[Aurora installation](../../operations/installation/).
+
+Unless otherwise stated, all commands are to be run from the root of the aurora
+repository clone.
+
+
+## The Script
+
+Our "hello world" application is a simple Python script that loops
+forever, displaying the time every few seconds. Copy the code below and
+put it in a file named `hello_world.py` in the root of your Aurora repository clone
+(Note: this directory is the same as `/vagrant` inside the Vagrant VMs).
+
+The script has an intentional bug, which we will explain later on.
+
+<!-- NOTE: If you are changing this file, be sure to also update examples/vagrant/test_tutorial.sh.
+-->
+```python
+import time
+
+def main():
+  SLEEP_DELAY = 10
+  # Python experts - ignore this blatant bug.
+  for i in xrang(100):
+    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
+      time.asctime(), SLEEP_DELAY))
+    time.sleep(SLEEP_DELAY)
+
+if __name__ == "__main__":
+  main()
+```
+
+## Aurora Configuration
+
+Once we have our script/program, we need to create a *configuration
+file* that tells Aurora how to manage and launch our Job. Save the below
+code in the file `hello_world.aurora`.
+
+<!-- NOTE: If you are changing this file, be sure to also update examples/vagrant/test_tutorial.sh.
+-->
+```python
+pkg_path = '/vagrant/hello_world.py'
+
+# we use a trick here to make the configuration change with
+# the contents of the file, for simplicity.  in a normal setting, packages would be
+# versioned, and the version number would be changed in the configuration.
+import hashlib
+with open(pkg_path, 'rb') as f:
+  pkg_checksum = hashlib.md5(f.read()).hexdigest()
+
+# copy hello_world.py into the local sandbox
+install = Process(
+  name = 'fetch_package',
+  cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, pkg_checksum))
+
+# run the script
+hello_world = Process(
+  name = 'hello_world',
+  cmdline = 'python -u hello_world.py')
+
+# describe the task
+hello_world_task = SequentialTask(
+  processes = [install, hello_world],
+  resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB))
+
+jobs = [
+  Service(cluster = 'devcluster',
+          environment = 'devel',
+          role = 'www-data',
+          name = 'hello_world',
+          task = hello_world_task)
+]
+```
+
+There is a lot going on in that configuration file:
+
+1. From a "big picture" viewpoint, it first defines two
+Processes. Then it defines a Task that runs the two Processes in the
+order specified in the Task definition, as well as specifying what
+computational and memory resources are available for them.  Finally,
+it defines a Job that will schedule the Task on available and suitable
+machines. This Job is the sole member of a list of Jobs; you can
+specify more than one Job in a config file.
+
+2. At the Process level, it specifies how to get your code into the
+local sandbox in which it will run. It then specifies how the code is
+actually run once the second Process starts.
+
+For more about Aurora configuration files, see the [Configuration
+Tutorial](../../reference/configuration-tutorial/) and the [Configuration
+Reference](../../reference/configuration/) (preferably after finishing this
+tutorial).
+
+
+## Creating the Job
+
+We're ready to launch our job! To do so, we use the Aurora Client to
+issue a Job creation request to the Aurora scheduler.
+
+Many Aurora Client commands take a *job key* argument, which uniquely
+identifies a Job. A job key consists of four parts, each separated by a
+"/". The four parts are  `<cluster>/<role>/<environment>/<jobname>`
+in that order:
+
+* Cluster refers to the name of a particular Aurora installation.
+* Role names are user accounts existing on the agent machines. If you
+don't know what accounts are available, contact your sysadmin.
+* Environment names are namespaces; you can count on `test`, `devel`,
+`staging` and `prod` existing.
+* Jobname is the custom name of your job.
+
+When comparing two job keys, if any of the four parts is different from
+its counterpart in the other key, then the two job keys identify two separate
+jobs. If all four values are identical, the job keys identify the same job.
+
+The `clusters.json` [client configuration](../../reference/client-cluster-configuration/)
+for the Aurora scheduler defines the available cluster names.
+For Vagrant, from the top-level of your Aurora repository clone, do:
+
+    $ vagrant ssh
+
+Followed by:
+
+    vagrant@aurora:~$ cat /etc/aurora/clusters.json
+
+You'll see something like the following. The `name` value shown here, corresponds to a job key's cluster value.
+
+```javascript
+[{
+  "name": "devcluster",
+  "zk": "192.168.33.7",
+  "scheduler_zk_path": "/aurora/scheduler",
+  "auth_mechanism": "UNAUTHENTICATED",
+  "slave_run_directory": "latest",
+  "slave_root": "/var/lib/mesos"
+}]
+```
+
+The Aurora Client command that actually runs our Job is `aurora job create`. It creates a Job as
+specified by its job key and configuration file arguments and runs it.
+
+    aurora job create <cluster>/<role>/<environment>/<jobname> <config_file>
+
+Or for our example:
+
+    aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
+
+After entering our virtual machine using `vagrant ssh`, this returns:
+
+    vagrant@aurora:~$ aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
+     INFO] Creating job hello_world
+     INFO] Checking status of devcluster/www-data/devel/hello_world
+    Job create succeeded: job url=http://aurora.local:8081/scheduler/www-data/devel/hello_world
+
+
+## Watching the Job Run
+
+Now that our job is running, let's see what it's doing. Access the
+scheduler web interface at `http://$scheduler_hostname:$scheduler_port/scheduler`
+Or when using `vagrant`, `http://192.168.33.7:8081/scheduler`
+First we see what Jobs are scheduled:
+
+![Scheduled Jobs](../images/ScheduledJobs.png)
+
+Click on your user name, which in this case was `www-data`, and we see the Jobs associated
+with that role:
+
+![Role Jobs](../images/RoleJobs.png)
+
+If you click on your `hello_world` Job, you'll see:
+
+![hello_world Job](../images/HelloWorldJob.png)
+
+Oops, looks like our first job didn't quite work! The task is temporarily throttled for
+having failed on every attempt of the Aurora scheduler to run it. We have to figure out
+what is going wrong.
+
+On the Completed tasks tab, we see all past attempts of the Aurora scheduler to run our job.
+
+![Completed tasks tab](../images/CompletedTasks.png)
+
+We can navigate to the Task page of a failed run by clicking on the host link.
+
+![Task page](../images/TaskBreakdown.png)
+
+Once there, we see that the `hello_world` process failed. The Task page
+captures the standard error and standard output streams and makes them available.
+Clicking through to `stderr` on the failed `hello_world` process, we see what happened.
+
+![stderr page](../images/stderr.png)
+
+It looks like we made a typo in our Python script. We wanted `xrange`,
+not `xrang`. Edit the `hello_world.py` script to use the correct function
+and save it as `hello_world_v2.py`. Then update the `hello_world.aurora`
+configuration to the newest version.
+
+In order to try again, we can now instruct the scheduler to update our job:
+
+    vagrant@aurora:~$ aurora update start devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
+     INFO] Starting update for: hello_world
+    Job update has started. View your update progress at http://aurora.local:8081/scheduler/www-data/devel/hello_world/update/8ef38017-e60f-400d-a2f2-b5a8b724e95b
+
+This time, the task comes up.
+
+![Running Job](../images/RunningJob.png)
+
+By again clicking on the host, we inspect the Task page, and see that the
+`hello_world` process is running.
+
+![Running Task page](../images/runningtask.png)
+
+We then inspect the output by clicking on `stdout` and see our process'
+output:
+
+![stdout page](../images/stdout.png)
+
+## Cleanup
+
+Now that we're done, we kill the job using the Aurora client:
+
+    vagrant@aurora:~$ aurora job killall devcluster/www-data/devel/hello_world
+     INFO] Killing tasks for job: devcluster/www-data/devel/hello_world
+     INFO] Instances to be killed: [0]
+    Successfully killed instances [0]
+    Job killall succeeded
+
+The job page now shows the `hello_world` tasks as completed.
+
+![Killed Task page](../images/killedtask.png)
+
+## Next Steps
+
+Now that you've finished this Tutorial, you should read or do the following:
+
+- [The Aurora Configuration Tutorial](../../reference/configuration-tutorial/), which provides more examples
+  and best practices for writing Aurora configurations. You should also look at
+  the [Aurora Configuration Reference](../../reference/configuration/).
+- Explore the Aurora Client - use `aurora -h`, and read the
+  [Aurora Client Commands](../../reference/client-commands/) document.

Added: aurora/site/source/documentation/0.16.0/getting-started/vagrant.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/getting-started/vagrant.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/getting-started/vagrant.md (added)
+++ aurora/site/source/documentation/0.16.0/getting-started/vagrant.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,154 @@
+A local Cluster with Vagrant
+============================
+
+This document shows you how to configure a complete cluster using a virtual machine. This setup
+replicates a real cluster in your development machine as closely as possible. After you complete
+the steps outlined here, you will be ready to create and run your first Aurora job.
+
+The following sections describe these steps in detail:
+
+1. [Overview](#overview)
+1. [Install VirtualBox and Vagrant](#install-virtualbox-and-vagrant)
+1. [Clone the Aurora repository](#clone-the-aurora-repository)
+1. [Start the local cluster](#start-the-local-cluster)
+1. [Log onto the VM](#log-onto-the-vm)
+1. [Run your first job](#run-your-first-job)
+1. [Rebuild components](#rebuild-components)
+1. [Shut down or delete your local cluster](#shut-down-or-delete-your-local-cluster)
+1. [Troubleshooting](#troubleshooting)
+
+
+Overview
+--------
+
+The Aurora distribution includes a set of scripts that enable you to create a local cluster in
+your development machine. These scripts use [Vagrant](https://www.vagrantup.com/) and
+[VirtualBox](https://www.virtualbox.org/) to run and configure a virtual machine. Once the
+virtual machine is running, the scripts install and initialize Aurora and any required components
+to create the local cluster.
+
+
+Install VirtualBox and Vagrant
+------------------------------
+
+First, download and install [VirtualBox](https://www.virtualbox.org/) on your development machine.
+
+Then download and install [Vagrant](https://www.vagrantup.com/). To verify that the installation
+was successful, open a terminal window and type the `vagrant` command. You should see a list of
+common commands for this tool.
+
+
+Clone the Aurora repository
+---------------------------
+
+To obtain the Aurora source distribution, clone its Git repository using the following command:
+
+     git clone git://git.apache.org/aurora.git
+
+
+Start the local cluster
+-----------------------
+
+Now change into the `aurora/` directory, which contains the Aurora source code and
+other scripts and tools:
+
+     cd aurora/
+
+To start the local cluster, type the following command:
+
+     vagrant up
+
+This command uses the configuration scripts in the Aurora distribution to:
+
+* Download a Linux system image.
+* Start a virtual machine (VM) and configure it.
+* Install the required build tools on the VM.
+* Install Aurora's requirements (like [Mesos](http://mesos.apache.org/) and
+[Zookeeper](http://zookeeper.apache.org/)) on the VM.
+* Build and install Aurora from source on the VM.
+* Start Aurora's services on the VM.
+
+This process takes several minutes to complete.
+
+You may notice a warning that guest additions in the VM don't match your version of VirtualBox.
+This should generally be harmless, but you may wish to install a vagrant plugin to take care of
+mismatches like this for you:
+
+     vagrant plugin install vagrant-vbguest
+
+With this plugin installed, whenever you `vagrant up` the plugin will upgrade the guest additions
+for you when a version mis-match is detected. You can read more about the plugin
+[here](https://github.com/dotless-de/vagrant-vbguest).
+
+To verify that Aurora is running on the cluster, visit the following URLs:
+
+* Scheduler - http://192.168.33.7:8081
+* Observer - http://192.168.33.7:1338
+* Mesos Master - http://192.168.33.7:5050
+* Mesos Agent - http://192.168.33.7:5051
+
+
+Log onto the VM
+---------------
+
+To SSH into the VM, run the following command in your development machine:
+
+     vagrant ssh
+
+To verify that Aurora is installed in the VM, type the `aurora` command. You should see a list
+of arguments and possible commands.
+
+The `/vagrant` directory on the VM is mapped to the `aurora/` local directory
+from which you started the cluster. You can edit files inside this directory in your development
+machine and access them from the VM under `/vagrant`.
+
+A pre-installed `clusters.json` file refers to your local cluster as `devcluster`, which you
+will use in client commands.
+
+
+Run your first job
+------------------
+
+Now that your cluster is up and running, you are ready to define and run your first job in Aurora.
+For more information, see the [Aurora Tutorial](../tutorial/).
+
+
+Rebuild components
+------------------
+
+If you are changing Aurora code and would like to rebuild a component, you can use the `aurorabuild`
+command on the VM to build and restart a component.  This is considerably faster than destroying
+and rebuilding your VM.
+
+`aurorabuild` accepts a list of components to build and update. To get a list of supported
+components, invoke the `aurorabuild` command with no arguments:
+
+     vagrant ssh -c 'aurorabuild client'
+
+
+Shut down or delete your local cluster
+--------------------------------------
+
+To shut down your local cluster, run the `vagrant halt` command in your development machine. To
+start it again, run the `vagrant up` command.
+
+Once you are finished with your local cluster, or if you would otherwise like to start from scratch,
+you can use the command `vagrant destroy` to turn off and delete the virtual file system.
+
+
+Troubleshooting
+---------------
+
+Most of the Vagrant related problems can be fixed by the following steps:
+
+* Destroying the vagrant environment with `vagrant destroy`
+* Killing any orphaned VMs (see AURORA-499) with `virtualbox` UI or `VBoxManage` command line tool
+* Cleaning the repository of build artifacts and other intermediate output with `git clean -fdx`
+* Bringing up the vagrant environment with `vagrant up`
+
+If that still doesn't solve your problem, make sure to inspect the log files:
+
+* Scheduler: `/var/log/upstart/aurora-scheduler.log`
+* Observer: `/var/log/upstart/aurora-thermos-observer.log`
+* Mesos Master: `/var/log/mesos/mesos-master.INFO` (also see `.WARNING` and `.ERROR`)
+* Mesos Agent: `/var/log/mesos/mesos-slave.INFO` (also see `.WARNING` and `.ERROR`)

Added: aurora/site/source/documentation/0.16.0/images/CPUavailability.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/CPUavailability.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/CPUavailability.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/CompletedTasks.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/CompletedTasks.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/CompletedTasks.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/HelloWorldJob.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/HelloWorldJob.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/HelloWorldJob.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/RoleJobs.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/RoleJobs.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/RoleJobs.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/RunningJob.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/RunningJob.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/RunningJob.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/ScheduledJobs.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/ScheduledJobs.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/ScheduledJobs.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/TaskBreakdown.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/TaskBreakdown.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/TaskBreakdown.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/aurora_hierarchy.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/aurora_hierarchy.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/aurora_hierarchy.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/aurora_logo.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/aurora_logo.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/aurora_logo.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/components.odg
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/components.odg?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/components.odg
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/components.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/components.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/components.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/debug-client-test.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/debug-client-test.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/debug-client-test.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/debugging-client-test.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/debugging-client-test.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/debugging-client-test.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/killedtask.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/killedtask.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/killedtask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/lifeofatask.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/lifeofatask.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/lifeofatask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/02_28_2015_apache_aurora_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/02_28_2015_apache_aurora_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/02_28_2015_apache_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/08_21_2014_past_present_future_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/08_21_2014_past_present_future_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/08_21_2014_past_present_future_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/runningtask.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/runningtask.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/runningtask.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/stderr.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/stderr.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/stderr.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/stdout.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/stdout.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/stdout.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/images/storage_hierarchy.png
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/images/storage_hierarchy.png?rev=1762695&view=auto
==============================================================================
Binary file - no diff available.

Propchange: aurora/site/source/documentation/0.16.0/images/storage_hierarchy.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: aurora/site/source/documentation/0.16.0/index.html.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/index.html.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/index.html.md (added)
+++ aurora/site/source/documentation/0.16.0/index.html.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,75 @@
+## Introduction
+
+Apache Aurora is a service scheduler that runs on top of Apache Mesos, enabling you to run
+long-running services, cron jobs, and ad-hoc jobs that take advantage of Apache Mesos' scalability,
+fault-tolerance, and resource isolation.
+
+We encourage you to ask questions on the [Aurora user list](http://aurora.apache.org/community/) or
+the `#aurora` IRC channel on `irc.freenode.net`.
+
+
+## Getting Started
+Information for everyone new to Apache Aurora.
+
+ * [Aurora System Overview](getting-started/overview/)
+ * [Hello World Tutorial](getting-started/tutorial/)
+ * [Local cluster with Vagrant](getting-started/vagrant/)
+
+## Features
+Description of important Aurora features.
+
+ * [Containers](features/containers/)
+ * [Cron Jobs](features/cron-jobs/)
+ * [Custom Executors](features/custom-executors/)
+ * [Job Updates](features/job-updates/)
+ * [Multitenancy](features/multitenancy/)
+ * [Resource Isolation](features/resource-isolation/)
+ * [Scheduling Constraints](features/constraints/)
+ * [Services](features/services/)
+ * [Service Discovery](features/service-discovery/)
+ * [SLA Metrics](features/sla-metrics/)
+ * [Webhooks](features/webhooks/)
+
+## Operators
+For those that wish to manage and fine-tune an Aurora cluster.
+
+ * [Installation](operations/installation/)
+ * [Configuration](operations/configuration/)
+ * [Monitoring](operations/monitoring/)
+ * [Security](operations/security/)
+ * [Storage](operations/storage/)
+ * [Backup](operations/backup-restore/)
+
+## Reference
+The complete reference of commands, configuration options, and scheduler internals.
+
+ * [Task lifecycle](reference/task-lifecycle/)
+ * Configuration (`.aurora` files)
+    - [Configuration Reference](reference/configuration/)
+    - [Configuration Tutorial](reference/configuration-tutorial/)
+    - [Configuration Best Practices](reference/configuration-best-practices/)
+    - [Configuration Templating](reference/configuration-templating/)
+ * Aurora Client
+    - [Client Commands](reference/client-commands/)
+    - [Client Hooks](reference/client-hooks/)
+    - [Client Cluster Configuration](reference/client-cluster-configuration/)
+ * [Scheduler Configuration](reference/scheduler-configuration/)
+
+## Additional Resources
+ * [Tools integrating with Aurora](additional-resources/tools/)
+ * [Presentation videos and slides](additional-resources/presentations/)
+
+## Developers
+All the information you need to start modifying Aurora and contributing back to the project.
+
+ * [Contributing to the project](contributing/)
+ * [Committer's Guide](development/committers-guide/)
+ * [Design Documents](development/design-documents/)
+ * Developing the Aurora components:
+     - [Client](development/client/)
+     - [Scheduler](development/scheduler/)
+     - [Scheduler UI](development/ui/)
+     - [Thermos](development/thermos/)
+     - [Thrift structures](development/thrift/)
+
+

Added: aurora/site/source/documentation/0.16.0/operations/backup-restore.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.16.0/operations/backup-restore.md?rev=1762695&view=auto
==============================================================================
--- aurora/site/source/documentation/0.16.0/operations/backup-restore.md (added)
+++ aurora/site/source/documentation/0.16.0/operations/backup-restore.md Wed Sep 28 18:23:53 2016
@@ -0,0 +1,91 @@
+# Recovering from a Scheduler Backup
+
+**Be sure to read the entire page before attempting to restore from a backup, as it may have
+unintended consequences.**
+
+# Summary
+
+The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an
+earlier, backed up, version and requires all schedulers to be taken down temporarily while
+restoring. Once completed, the scheduler state resets to what it was when the backup was created.
+This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will
+be killed shortly after the cluster restarts. All other tasks continue operating as normal.
+
+Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few
+hours). This is because the scheduler will expect the cluster to look exactly as the backup does,
+so any tasks that have been rescheduled since the backup was taken will be killed.
+
+Instructions below have been verified in [Vagrant environment](../../getting-started/vagrant/) and with minor
+syntax/path changes should be applicable to any Aurora cluster.
+
+# Preparation
+
+Follow these steps to prepare the cluster for restoring from a backup:
+
+* Stop all scheduler instances
+
+* Consider blocking external traffic on a port defined in `-http_port` for all schedulers to
+prevent users from interacting with the scheduler during the restoration process. This will help
+troubleshooting by reducing the scheduler log noise and prevent users from making changes that will
+be erased after the backup snapshot is restored.
+
+* Configure `aurora_admin` access to run all commands listed in
+  [Restore from backup](#restore-from-backup) section locally on the leading scheduler:
+  * Make sure the [clusters.json](../../reference/client-cluster-configuration/) file configured to
+    access scheduler directly. Set `scheduler_uri` setting and remove `zk`. Since leader can get
+    re-elected during the restore steps, consider doing it on all scheduler replicas.
+  * Depending on your particular security approach you will need to either turn off scheduler
+    authorization by removing scheduler `-http_authentication_mechanism` flag or make sure the
+    direct scheduler access is properly authorized. E.g.: in case of Kerberos you will need to make
+    a `/etc/hosts` file change to match your local IP to the scheduler URL configured in keytabs:
+
+        <local_ip> <scheduler_domain_in_keytabs>
+
+* Next steps are required to put scheduler into a partially disabled state where it would still be
+able to accept storage recovery requests but unable to schedule or change task states. This may be
+accomplished by updating the following scheduler configuration options:
+  * Set `-mesos_master_address` to a non-existent zk address. This will prevent scheduler from
+    registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:1111/mesos/master`
+  * `-max_registration_delay` - set to sufficiently long interval to prevent registration timeout
+    and as a result scheduler suicide. E.g: `-max_registration_delay=360mins`
+  * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: `365days`) to
+    prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster
+    state and will kill all tasks when restarted with an empty Mesos replicated log.
+
+* Restart all schedulers
+
+# Cleanup and re-initialize Mesos replicated log
+
+Get rid of the corrupted files and re-initialize Mesos replicated log:
+
+* Stop schedulers
+* Delete all files under `-native_log_file_path` on all schedulers
+* Initialize Mesos replica's log file: `sudo mesos-log initialize --path=<-native_log_file_path>`
+* Start schedulers
+
+# Restore from backup
+
+At this point the scheduler is ready to rehydrate from the backup:
+
+* Identify the leading scheduler by:
+  * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at the scheduler
+    `/vars` endpoint. Leader will have 1. All other replicas - 0.
+  * examining scheduler logs
+  * or examining Zookeeper registration under the path defined by `-zk_endpoints`
+    and `-serverset_path`
+
+* Locate the desired backup file, copy it to the leading scheduler's `-backup_dir` folder and stage
+recovery by running the following command on a leader
+`aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>`
+
+* At this point, the recovery snapshot is staged and available for manual verification/modification
+via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and
+`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands.
+See `aurora_admin help <command>` for usage details.
+
+* Commit recovery. This instructs the scheduler to overwrite the existing Mesos replicated log with
+the provided backup snapshot and initiate a mandatory failover
+`aurora_admin scheduler_commit_recovery --bypass-leader-redirect  <cluster>`
+
+# Cleanup
+Undo any modification done during [Preparation](#preparation) sequence.



Mime
View raw message