aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jfarr...@apache.org
Subject [02/18] incubator-aurora-website git commit: Initial move of website over to git
Date Wed, 17 Dec 2014 18:30:27 GMT
http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/deploying-aurora-scheduler.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/deploying-aurora-scheduler.md b/source/documentation/latest/deploying-aurora-scheduler.md
new file mode 100644
index 0000000..5b6052e
--- /dev/null
+++ b/source/documentation/latest/deploying-aurora-scheduler.md
@@ -0,0 +1,259 @@
+# Deploying the Aurora Scheduler
+
+When setting up your cluster, you will install the scheduler on a small number (usually 3 or 5) of
+machines.  This guide helps you get the scheduler set up and troubleshoot some common hurdles.
+
+- [Installing Aurora](#installing-aurora)
+  - [Creating the Distribution .zip File (Optional)](#creating-the-distribution-zip-file-optional)
+  - [Installing Aurora](#installing-aurora-1)
+- [Configuring Aurora](#configuring-aurora)
+  - [A Note on Configuration](#a-note-on-configuration)
+  - [Replicated Log Configuration](#replicated-log-configuration)
+  - [Initializing the Replicated Log](#initializing-the-replicated-log)
+  - [Storage Performance Considerations](#storage-performance-considerations)
+  - [Network considerations](#network-considerations)
+- [Running Aurora](#running-aurora)
+  - [Maintaining an Aurora Installation](#maintaining-an-aurora-installation)
+  - [Monitoring](#monitoring)
+  - [Running stateful services](#running-stateful-services)
+    - [Dedicated attribute](#dedicated-attribute)
+      - [Syntax](#syntax)
+      - [Example](#example)
+- [Common problems](#common-problems)
+  - [Replicated log not initialized](#replicated-log-not-initialized)
+    - [Symptoms](#symptoms)
+    - [Solution](#solution)
+  - [Scheduler not registered](#scheduler-not-registered)
+    - [Symptoms](#symptoms-1)
+    - [Solution](#solution-1)
+  - [Tasks are stuck in PENDING forever](#tasks-are-stuck-in-pending-forever)
+    - [Symptoms](#symptoms-2)
+    - [Solution](#solution-2)
+
+## Installing Aurora
+The Aurora scheduler is a standalone Java server. As part of the build process it creates a bundle
+of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target server
+should have a JVM (Java 7 or higher) and libmesos (0.20.1) installed.
+
+### Creating the Distribution .zip File (Optional)
+To create a distribution for installation you will need build tools installed. On Ubuntu this can be
+done with `sudo apt-get install build-essential default-jdk`.
+
+    git clone http://git-wip-us.apache.org/repos/asf/incubator-aurora.git
+    cd incubator-aurora
+    ./gradlew distZip
+
+Copy the generated `dist/distributions/aurora-scheduler-*.zip` to each node that will run a scheduler.
+
+### Installing Aurora
+Extract the aurora-scheduler zip file. The example configurations assume it is extracted to
+`/usr/local/aurora-scheduler`.
+
+    sudo unzip dist/distributions/aurora-scheduler-*.zip -d /usr/local
+    sudo ln -nfs "$(ls -dt /usr/local/aurora-scheduler-* | head -1)" /usr/local/aurora-scheduler
+
+## Configuring Aurora
+
+### A Note on Configuration
+Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora
+"configuration file" is typically a `scheduler.sh` shell script of the form.
+
+    #!/bin/bash
+    AURORA_HOME=/usr/local/aurora-scheduler
+
+    # Flags controlling the JVM.
+    JAVA_OPTS=(
+      -Xmx2g
+      -Xms2g
+      # GC tuning, etc.
+    )
+
+    # Flags controlling the scheduler.
+    AURORA_FLAGS=(
+      -http_port=8081
+      # Log configuration, etc.
+    )
+
+    # Environment variables controlling libmesos
+    export JAVA_HOME=...
+    export GLOG_v=1
+    export LIBPROCESS_PORT=8083
+
+    JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" "${AURORA_FLAGS[@]}"
+
+That way Aurora's current flags are visible in `ps` and in the `/vars` admin endpoint.
+
+Examples are available under `examples/scheduler/`. For a list of available Aurora flags and their
+documentation run
+
+    /usr/local/aurora-scheduler/bin/aurora-scheduler -help
+
+### Replicated Log Configuration
+All Aurora state is persisted to a replicated log. This includes all jobs Aurora is running
+including where in the cluster they are being run and the configuration for running them, as
+well as other information such as metadata needed to reconnect to the Mesos master, resource
+quotas, and any other locks in place.
+
+Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is
+leader at a given time - the other schedulers follow log writes and prepare to take over as leader
+but do not communicate with the Mesos master. Either 3 or 5 schedulers are recommended in a
+production deployment depending on failure tolerance and they must have persistent storage.
+
+In a cluster with `N` schedulers, the flag `-native_log_quorum_size` should be set to
+`floor(N/2) + 1`. So in a cluster with 1 scheduler it should be set to `1`, in a cluster with 3 it
+should be set to `2`, and in a cluster of 5 it should be set to `3`.
+
+  Number of schedulers (N) | ```-native_log_quorum_size``` setting (```floor(N/2) + 1```)
+  ------------------------ | -------------------------------------------------------------
+  1                        | 1
+  3                        | 2
+  5                        | 3
+  7                        | 4
+
+*Incorrectly setting this flag will cause data corruption to occur!*
+
+See [this document](storage-config.md#scheduler-storage-configuration-flags) for more replicated
+log and storage configuration options.
+
+## Initializing the Replicated Log
+Before you start Aurora you will also need to initialize the log on the first master.
+
+    mesos-log initialize --path="$AURORA_HOME/scheduler/db"
+
+Failing to do this will result the following message when you try to start the scheduler.
+
+    Replica in EMPTY status received a broadcasted recover request
+
+### Storage Performance Considerations
+
+See [this document](/documentation/latest/scheduler-storage/) for scheduler storage performance considerations.
+
+### Network considerations
+The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI,
+and a libprocess (HTTP+Protobuf) port used to communicate with the Mesos master and for the log
+replication protocol. These can be left unconfigured (the scheduler publishes all selected ports
+to ZooKeeper) or explicitly set in the startup script as follows:
+
+    # ...
+    AURORA_FLAGS=(
+      # ...
+      -http_port=8081
+      # ...
+    )
+    # ...
+    export LIBPROCESS_PORT=8083
+    # ...
+
+## Running Aurora
+Configure a supervisor like [Monit](http://mmonit.com/monit/) or
+[supervisord](http://supervisord.org/) to run the created `scheduler.sh` file and restart it
+whenever it fails. Aurora expects to be restarted by an external process when it fails. Aurora
+supports an active health checking protocol on its admin HTTP interface - if a `GET /health` times
+out or returns anything other than `200 OK` the scheduler process is unhealthy and should be
+restarted.
+
+For example, monit can be configured with
+
+    if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart
+
+assuming you set `-http_port=8081`.
+
+### Maintaining an Aurora Installation
+
+### Monitoring
+Please see our dedicated [monitoring guide](/documentation/latest/monitoring/) for in-depth discussion on monitoring.
+
+### Running stateful services
+Aurora is best suited to run stateless applications, but it also accommodates for stateful services
+like databases, or services that otherwise need to always run on the same machines.
+
+#### Dedicated attribute
+The Mesos slave has the `--attributes` command line argument which can be used to mark a slave with
+static attributes (not to be confused with `--resources`, which are dynamic and accounted).
+
+Aurora makes these attributes available for matching with scheduling
+[constraints](configuration-reference.md#specifying-scheduling-constraints).  Most of these
+constraints are arbitrary and available for custom use.  There is one exception, though: the
+`dedicated` attribute.  Aurora treats this specially, and only allows matching jobs to run on these
+machines, and will only schedule matching jobs on these machines.
+
+##### Syntax
+The dedicated attribute has semantic meaning. The format is `$role(/.*)?`. When a job is created,
+the scheduler requires that the `$role` component matches the `role` field in the job
+configuration, and will reject the job creation otherwise.  The remainder of the attribute is
+free-form. We've developed the idiom of formatting this attribute as `$role/$job`, but do not
+enforce this.
+
+##### Example
+Consider the following slave command line:
+
+    mesos-slave --attributes="host:$HOST;rack:$RACK;dedicated:db_team/redis" ...
+
+And this job configuration:
+
+    Service(
+      name = 'redis',
+      role = 'db_team',
+      constraints = {
+        'dedicated': 'db_team/redis'
+      }
+      ...
+    )
+
+The job configuration is indicating that it should only be scheduled on slaves with the attribute
+`dedicated:dba_team/redis`.  Additionally, Aurora will prevent any tasks that do _not_ have that
+constraint from running on those slaves.
+
+
+## Common problems
+So you've started your first cluster and are running into some issues? We've collected some common
+stumbling blocks and solutions here to help get you moving.
+
+### Replicated log not initialized
+
+#### Symptoms
+- Scheduler RPCs and web interface claim `Storage is not READY`
+- Scheduler log repeatedly prints messages like
+
+  ```
+  I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status
+  received a broadcasted recover request
+  I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response
+  from a replica in EMPTY status
+  ```
+
+#### Solution
+When you create a new cluster, you need to inform a quorum of schedulers that they are safe to
+consider their database to be empty by [initializing](#initializing-the-replicated-log) the
+replicated log. This is done to prevent the scheduler from modifying the cluster state in the event
+of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path.
+
+### Scheduler not registered
+
+#### Symptoms
+Scheduler log contains
+
+    Framework has not been registered within the tolerated delay.
+
+#### Solution
+Double-check that the scheduler is configured correctly to reach the master. If you are registering
+the master in ZooKeeper, make sure command line argument to the master:
+
+    --zk=zk://$ZK_HOST:2181/mesos/master
+
+is the same as the one on the scheduler:
+
+    -mesos_master_address=zk://$ZK_HOST:2181/mesos/master
+
+### Tasks are stuck in `PENDING` forever
+
+#### Symptoms
+The scheduler is registered, and (receiving offers](docs/monitoring.md#scheduler_resource_offers),
+but tasks are perpetually shown as `PENDING - Constraint not satisfied: host`.
+
+#### Solution
+Check that your slaves are configured with `host` and `rack` attributes.  Aurora requires that
+slaves are tagged with these two common failure domains to ensure that it can safely place tasks
+such that jobs are resilient to failure.
+
+See our [vagrant example](examples/vagrant/upstart/mesos-slave.conf) for details.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/developing-aurora-client.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/developing-aurora-client.md b/source/documentation/latest/developing-aurora-client.md
new file mode 100644
index 0000000..923866c
--- /dev/null
+++ b/source/documentation/latest/developing-aurora-client.md
@@ -0,0 +1,122 @@
+Getting Started
+===============
+
+Aurora consists of four main pieces: the scheduler (which finds resources in the cluster that can be used to run a job), the executor (which uses the resources assigned by the scheduler to run a job), the command-line client, and the web-ui. For information about working on the scheduler or the webUI, see the file "developing-aurora-scheduler.md" in this directory.
+
+If you want to work on the command-line client, this is the place for you!
+
+The client is written in Python, and unlike the server side of things, we build the client using the Pants build tool, instead of Gradle. Pants is a tool that was built by twitter for handling builds of large collaborative systems. You can see a detailed explanation of
+pants [here](http://pantsbuild.github.io/python-readme.html).
+
+To build the client executable, run the following in a command-shell:
+
+    $ ./pants src/main/python/apache/aurora/client/cli:aurora2
+
+This will produce a python executable _pex_ file in `dist/aurora2.pex`. Pex files
+are fully self-contained executables: just copy the pex file into your path, and you'll be able to run it. For example, for a typical installation:
+
+    $ cp dist/aurora2.pex /usr/local/bin/aurora
+
+To run all of the client tests:
+
+    $ ./pasts src/test/python/apache/aurora/client/:all
+
+
+Client Configuration
+====================
+
+The client uses a configuration file that specifies available clusters. More information about the
+contents of this file can be found in the
+[Client Cluster Configuration](/documentation/latest/client-cluster-configuration/) documentation. Information about
+how the client locates this file can be found in the
+[Client Commands](client-commands.md#cluster-configuration) documentation.
+
+Client Versions
+===============
+
+There are currently two versions of the aurora client, imaginatively known as v1 and v2. All new development is done entirely in v2, but we continue to support and fix bugs in v1, until we get to the point where v2 is feature-complete and tested, and aurora users have had some time at adapt and switch their processes to use v2.
+
+Both versions are built on the same underlying API code.
+
+Client v1 was implemented using twitter.common.app. The command-line processing code for v1 can be found in `src/main/python/apache/aurora/client/commands` and
+`src/main/python/apache/aurora/client/bin`.
+
+Client v2 was implemented using its own noun/verb framework. The client v2 code can be found in `src/main/python/apache/aurora/client/cli`, and the noun/verb framework can be
+found in the `__init__.py` file in that directory.
+
+
+Building and Testing the Client
+===============================
+
+Building and testing the client code are both done using Pants. The relevant targets to know about are:
+
+   * Build a client v2 executable: `./pants src/main/python/apache/aurora/client/cli:aurora2`
+   * Test client v2 code: `./pants ./pants src/test/python/apache/aurora/client/cli:all`
+   * Build a client v1 executable: `./pants src/main/python/apache/aurora/client/bin:aurora_client`
+   * Test client v1 code: `./pants src/main/python/apache/aurora/client/commands:all`
+   * Test all client code: `./pants src/main/python/apache/aurora/client:all`
+
+
+Overview of the Client Architecture
+===================================
+
+The client is built on a stacked architecture:
+
+   1. At the lowest level, we have a thrift RPC API interface
+    to the aurora scheduler. The interface is declared in thrift, in the file
+    `src/main/thrift/org/apache/aurora/gen/api.thrift`.
+
+  2. On top of the primitive API, we have a client API. The client API
+    takes the primitive operations provided by the scheduler, and uses them
+    to implement client-side behaviors. For example, when you update a job,
+    on the scheduler, that's done by a sequence of operations.  The sequence is implemented
+    by the client API `update` method, which does the following using the thrift API:
+     * fetching the state of task instances in the mesos cluster, and figuring out which need
+       to be updated;
+     * For each task to be updated:
+         - killing the old version;
+         - starting the new version;
+         - monitoring the new version to ensure that the update succeeded.
+  3. On top of the API, we have the command-line client itself. The core client, at this level,
+    consists of the interface to the command-line which the user will use to interact with aurora.
+    The client v2 code is found in `src/python/apache/aurora/client/cli`. In the `cli` directory,
+    the rough structure is as follows:
+       * `__init__.py` contains the noun/verb command-line processing framework used by client v2.
+       * `jobs.py` contains the implementation of the core `job` noun, and all of its operations.
+       * `bridge.py` contains the implementation of a component that allows us to ship a
+         combined client that runs both v1 and v2 client commands during the transition period.
+       * `client.py` contains the code that binds the client v2 nouns and verbs into an executable.
+
+Running/Debugging the Client
+============================
+
+For manually testing client changes against a cluster, we use vagrant. To start a virtual cluster,
+you need to install a working vagrant environment, and then run "vagrant up" for the root of
+the aurora workspace. This will create a vagrant host named "devcluster", with a mesos master,
+a set of mesos slaves, and an aurora scheduler.
+
+To use the devcluster, you need to bring it up by running `vagrant up`, and then connect to the vagrant host using `vagrant ssh`. This will open a bash session on the virtual machine hosting the devcluster. In the home directory, there are two key paths to know about:
+
+   * `~/aurora`: this is a copy of the git workspace in which you launched the vagrant cluster.
+     To test client changes, you'll use this copy.
+   * `/vagrant`: this is a mounted filesystem that's a direct image of your git workspace.
+     This isn't a copy - it is your git workspace. Editing files on your host machine will
+     be immediately visible here, because they are the same files.
+
+Whenever the scheduler is modified, to update your vagrant environment to use the new scheduler,
+you'll need to re-initialize your vagrant images. To do this, you need to run two commands:
+
+   * `vagrant destroy`: this will delete the old devcluster image.
+   * `vagrant up`: this creates a fresh devcluster image based on the current state of your workspace.
+
+You should try to minimize rebuilding vagrant images; it's not horribly slow, but it does take a while.
+
+To test client changes:
+
+   * Make a change in your local workspace, and commit it.
+   * `vagrant ssh` into the devcluster.
+   * `cd aurora`
+   * Pull your changes into the vagrant copy: `git pull /vagrant *branchname*`.
+   * Build the modified client using pants.
+   * Run your command using `aurora2`. (You don't need to do any install; the aurora2 command
+     is a symbolic link to the executable generated by pants.)

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/developing-aurora-scheduler.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/developing-aurora-scheduler.md b/source/documentation/latest/developing-aurora-scheduler.md
new file mode 100644
index 0000000..7f6cc2e
--- /dev/null
+++ b/source/documentation/latest/developing-aurora-scheduler.md
@@ -0,0 +1,108 @@
+Java code in the aurora repo is built with [Gradle](http://gradle.org).
+
+Getting Started
+===============
+You will need Java 7 installed and on your `PATH` or unzipped somewhere with `JAVA_HOME` set. Then
+
+    ./gradlew tasks
+
+will bootstrap the build system and show available tasks. This can take a while the first time you
+run it but subsequent runs will be much faster due to cached artifacts.
+
+Running the Tests
+-----------------
+Aurora has a comprehensive unit test suite. To run the tests use
+
+    ./gradlew build
+
+Gradle will only re-run tests when dependencies of them have changed. To force a re-run of all
+tests use
+
+    ./gradlew clean build
+
+Running the build with code quality checks
+------------------------------------------
+To speed up development iteration, the plain gradle commands will not run static analysis tools.
+However, you should run these before posting a review diff, and **always** run this before pushing a
+commit to origin/master.
+
+    ./gradlew build -Pq
+
+Creating a bundle for deployment
+--------------------------------
+Gradle can create a zip file containing Aurora, all of its dependencies, and a launch script with
+
+    ./gradlew distZip
+
+or a tar file containing the same files with
+
+    ./gradlew distTar
+
+The output file will be written to `dist/distributions/aurora-scheduler.zip` or
+`dist/distributions/aurora-scheduler.tar`.
+
+Developing Aurora Java code
+===========================
+
+Setting up an IDE
+-----------------
+Gradle can generate project files for your IDE. To generate an IntelliJ IDEA project run
+
+    ./gradlew idea
+
+and import the generated `aurora.ipr` file.
+
+Adding or Upgrading a Dependency
+--------------------------------
+New dependencies can be added from Maven central by adding a `compile` dependency to `build.gradle`.
+For example, to add a dependency on `com.example`'s `example-lib` 1.0 add this block:
+
+    compile 'com.example:example-lib:1.0'
+
+NOTE: Anyone thinking about adding a new dependency should first familiarize themself with the
+Apache Foundation's third-party licensing
+[policy](http://www.apache.org/legal/resolved.html#category-x).
+
+Developing Aurora UI
+======================
+
+Installing bower (optional)
+----------------------------
+Third party JS libraries used in Aurora (located at 3rdparty/javascript/bower_components) are
+managed by bower, a JS dependency manager. Bower is only required if you plan to add, remove or
+update JS libraries. Bower can be installed using the following command:
+
+    npm install -g bower
+
+Bower depends on node.js and npm. The easiest way to install node on a mac is via brew:
+
+    brew install node
+
+For more node.js installation options refer to https://github.com/joyent/node/wiki/Installation.
+
+More info on installing and using bower can be found at: http://bower.io/. Once installed, you can
+use the following commands to view and modify the bower repo at
+3rdparty/javascript/bower_components
+
+    bower list
+    bower install <library name>
+    bower remove <library name>
+    bower update <library name>
+    bower help
+
+Developing the Aurora Build System
+==================================
+
+Bootstrapping Gradle
+--------------------
+The following files were autogenerated by `gradle wrapper` using gradle 1.8's
+[Wrapper](http://www.gradle.org/docs/1.8/dsl/org.gradle.api.tasks.wrapper.Wrapper.html) plugin and
+should not be modified directly:
+
+    ./gradlew
+    ./gradlew.bat
+    ./gradle/wrapper/gradle-wrapper.jar
+    ./gradle/wrapper/gradle-wrapper.properties
+
+To upgrade Gradle unpack the new version somewhere, run `/path/to/new/gradle wrapper` in the
+repository root and commit the changed files.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/hooks.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/hooks.md b/source/documentation/latest/hooks.md
new file mode 100644
index 0000000..533c81d
--- /dev/null
+++ b/source/documentation/latest/hooks.md
@@ -0,0 +1,244 @@
+# Hooks for Aurora Client API
+
+- [Introduction](#introduction)
+- [Hook Types](#hook-types)
+- [Execution Order](#execution-order)
+- [Hookable Methods](#hookable-methods)
+- [Activating and Using Hooks](#activating-and-using-hooks)
+- [.aurora Config File Settings](#aurora-config-file-settings)
+- [Command Line](#command-line)
+- [Hooks Protocol](#hooks-protocol)
+  - [pre_ Methods](#pre_-methods)
+  - [err_ Methods](#err_-methods)
+  - [post_ Methods](#post_-methods)
+- [Generic Hooks](#generic-hooks)
+- [Hooks Process Checklist](#hooks-process-checklist)
+
+## Introduction
+
+You can execute hook methods around Aurora API Client methods when they are called by the Aurora Command Line commands.
+
+Explaining how hooks work is a bit tricky because of some indirection about what they apply to. Basically, a hook is code that executes when a particular Aurora Client API method runs, letting you extend the method's actions. The hook executes on the client side, specifically on the machine executing Aurora commands.
+
+The catch is that hooks are associated with Aurora Client API methods, which users don't directly call. Instead, users call Aurora Command Line commands, which call Client API methods during their execution. Since which hooks run depend on which Client API methods get called, you will need to know which Command Line commands call which API methods. Later on, there is a table showing the various associations.
+
+**Terminology Note**: From now on, "method(s)" refer to Client API methods, and "command(s)" refer to Command Line commands.
+
+## Hook Types
+
+Hooks have three basic types, differing by when they run with respect to their associated method.
+
+`pre_<method_name>`: When its associated method is called, the `pre_` hook executes first, then the called method. If the `pre_` hook fails, the method never runs. Later code that expected the method to succeed may be affected by this, and result in terminating the Aurora client.
+
+Note that a `pre_` hook can error-trap internally so it does not
+return `False`. Designers/contributors of new `pre_` hooks should
+consider whether or not to error-trap them. You can error trap at the
+highest level very generally and always pass the `pre_` hook by
+returning `True`. For example:
+
+```python
+def pre_create(...):
+  do_something()  # if do_something fails with an exception, the create_job is not attempted!
+  return True
+
+# However...
+def pre_create(...):
+  try:
+    do_something()  # may cause exception
+  except Exception:  # generic error trap will catch it
+    pass  # and ignore the exception
+  return True  # create_job will run in any case!
+```
+
+`post_<method_name>`: A `post_` hook executes after its associated method successfully finishes running. If it fails, the already executed method is unaffected. A `post_` hook's error is trapped, and any later operations are unaffected.
+
+`err_<method_name>`: Executes only when its associated method returns a status other than OK or throws an exception. If an `err_` hook fails, the already executed method is unaffected. An `err_` hook's error is trapped, and any later operations are unaffected.
+
+## Execution Order
+
+A command with `pre_`, `post_`, and `err_` hooks defined and activated for its called method executes in the following order when the method successfully executes:
+
+1. Command called
+2. Command code executes
+3. Method Called
+4. `pre_` method hook runs
+5. Method runs and successfully finishes
+6. `post_` method hook runs
+7. Command code executes
+8. Command execution ends
+
+The following is what happens when, for the same command and hooks, the method associated with the command suffers an error and does not successfully finish executing:
+
+1. Command called
+2. Command code executes
+3. Method Called
+4. `pre_` method hook runs
+5. Method runs and fails
+6. `err_` method hook runs
+7. Command Code executes (if `err_` method does not end the command execution)
+8. Command execution ends
+
+Note that the `post_` and `err_` hooks for the same method can never both run for a single execution of that method.
+
+## Hookable Methods
+
+You can associate `pre_`, `post_`, and `err_` hooks with the following methods. Since you do not directly interact with the methods, but rather the Aurora Command Line commands that call them, for each method we also list the command(s) that can call the method. Note that a different method or methods may be called by a command depending on how the command's other code executes. Similarly, multiple commands can call the same method. We also list the methods' argument signatures, which are used by their associated hooks. <a name="Chart"></a>
+
+  Aurora Client API Method | Client API Method Argument Signature | Aurora Command Line Command
+  -------------------------| ------------------------------------- | ---------------------------
+  ```cancel_update``` | ```self```, ```job_key``` | ```cancel_update```
+  ```create_job``` | ```self```, ```config``` | ```create```, <code>runtask
+  ```restart``` | ```self```, ```job_key```, ```shards```, ```update_config```, ```health_check_interval_seconds``` | ```restart```
+  ```update_job``` | ```self```, ```config```, ```health_check_interval_seconds=3```, ```shards=None``` | ```update```
+  ```kill_job``` | ```self```, ```job_key```, ```shards=None``` |  ```kill```
+
+Some specific examples:
+
+* `pre_create_job` executes when a `create_job` method is called, and before the `create_job` method itself executes.
+
+* `post_cancel_update` executes after a `cancel_update` method has successfully finished running.
+
+* `err_kill_job` executes when the `kill_job` method is called, but doesn't successfully finish running.
+
+## Activating and Using Hooks
+
+By default, hooks are inactive. If you do not want to use hooks, you do not need to make any changes to your code. If you do want to use hooks, you will need to alter your `.aurora` config file to activate them both for the configuration as a whole as well as for individual `Job`s. And, of course, you will need to define in your config file what happens when a particular hook executes.
+
+## .aurora Config File Settings
+
+You can define a top-level `hooks` variable in any `.aurora` config file. `hooks` is a list of all objects that define hooks used by `Job`s defined in that config file. If you do not want to define any hooks for a configuration, `hooks` is optional.
+
+    hooks = [Object_with_defined_hooks1, Object_with_defined_hooks2]
+
+Be careful when assembling a config file using `include` on multiple smaller config files. If there are multiple files that assign a value to `hooks`, only the last assignment made will stick. For example, if `x.aurora` has `hooks = [a, b, c]` and `y.aurora` has `hooks = [d, e, f]` and `z.aurora` has, in this order, `include x.aurora` and `include y.aurora`, the `hooks` value will be `[d, e, f]`.
+
+Also, for any `Job` that you want to use hooks with, its `Job` definition in the `.aurora` config file must set an `enable_hooks` flag to `True` (it defaults to `False`). By default, hooks are disabled and you must enable them for `Job`s of your choice.
+
+To summarize, to use hooks for a particular job, you must both activate hooks for your config file as a whole, and for that job. Activating hooks only for individual jobs won't work, nor will only activating hooks for your config file as a whole. You must also specify the hooks' defining object in the `hooks` variable.
+
+Recall that `.aurora` config files are written in Pystachio. So the following turns on hooks for production jobs at cluster1 and cluster2, but leaves them off for similar jobs with a defined user role. Of course, you also need to list the objects that define the hooks in your config file's `hooks` variable.
+
+```python
+jobs = [
+        Job(enable_hooks = True, cluster = c, env = 'prod') for c in ('cluster1', 'cluster2')
+       ]
+jobs.extend(
+   Job(cluster = c, env = 'prod', role = getpass.getuser()) for c in ('cluster1', 'cluster2'))
+   # Hooks disabled for these jobs
+```
+
+## Command Line
+
+All Aurora Command Line commands now accept an `.aurora` config file as an optional parameter (some, of course, accept it as a required parameter). Whenever a command has a `.aurora` file parameter, any hooks specified and activated in the `.aurora` file can be used. For example:
+
+    aurora restart cluster1/role/env/app myapp.aurora
+
+The command activates any hooks specified and activated in `myapp.aurora`. For the `restart` command, that is the only thing the `myapp.aurora` parameter does. So, if the command was the following, since there is no `.aurora` config file to specify any hooks, no hooks on the `restart` command can run.
+
+    aurora restart cluster1/role/env/app
+
+## Hooks Protocol
+
+Any object defined in the `.aurora` config file can define hook methods. You should define your hook methods within a class, and then use the class name as a value in the `hooks` list in your config file.
+
+Note that you can define other methods in the class that its hook methods can call; all the logic of a hook does not have to be in its definition.
+
+The following example defines a class containing a `pre_kill_job` hook definition that calls another method defined in the class.
+
+```python
+# Defines a method pre_kill_job
+class KillConfirmer(object):
+  def confirm(self, msg):
+    return raw_input(msg).lower() == 'yes'
+
+  def pre_kill_job(self, job_key, shards=None):
+    shards = ('shards %s' % shards) if shards is not None else 'all shards'
+    return self.confirm('Are you sure you want to kill %s (%s)? (yes/no): '
+                        % (job_key, shards))
+```
+
+### pre_ Methods
+
+`pre_` methods have the signature:
+
+    pre_<API method name>(self, <associated method's signature>)
+
+`pre_` methods have the same signature as their associated method, with the addition of `self` as the first parameter. See the [chart](#Chart) above for the mapping of parameters to methods. When writing `pre_` methods, you can use the `*` and `**` syntax to designate that all unspecified parameters are passed in a list to the `*`ed variable and all named parameters with values are passed as name/value pairs to the `**`ed variable.
+
+If this method returns False, the API command call aborts.
+
+### err_ Methods
+
+`err_` methods have the signature:
+
+    err_<API method name>(self, exc, <associated method's signature>)
+
+`err_` methods have the same signature as their associated method, with the addition of a first parameter `self` and a second parameter `exc`. `exc` is either a result with responseCode other than `ResponseCode.OK` or an `Exception`. See the [chart](#Chart) above for the mapping of parameters to methods. When writing `err`_ methods, you can use the `*` and `**` syntax to designate that all unspecified parameters are passed in a list to the `*`ed variable and all named parameters with values are passed as name/value pairs to the `**`ed variable.
+
+`err_` method return codes are ignored.
+
+### post_ Methods
+
+`post_` methods have the signature:
+
+    post_<API method name>(self, result, <associated method signature>)
+
+`post_` method parameters are `self`, then `result`, followed by the same parameter signature as their associated method. `result` is the result of the associated method call. See the [chart](#chart) above for the mapping of parameters to methods. When writing `post_` methods, you can use the `*` and `**` syntax to designate that all unspecified arguments are passed in a list to the `*`ed parameter and all unspecified named arguments with values are passed as name/value pairs to the `**`ed parameter.
+
+`post_` method return codes are ignored.
+
+## Generic Hooks
+
+There are five Aurora API Methods which any of the three hook types can attach to. Thus, there are 15 possible hook/method combinations for a single `.aurora` config file. Say that you define `pre_` and `post_` hooks for the `restart` method. That leaves 13 undefined hook/method combinations; `err_restart` and the 3 `pre_`, `post_`, and `err_` hooks for each of the other 4 hookable methods. You can define what happens when any of these otherwise undefined 13 hooks execute via a generic hook, whose signature is:
+
+```python
+generic_hook(self, hook_config, event, method_name, result_or_err, args*, kw**)
+```
+
+where:
+
+* `hook_config` is a named tuple of `config` (the Pystashio `config` object) and `job_key`.
+
+* `event` is one of `pre`, `err`, or `post`, indicating which type of hook the genetic hook is standing in for. For example, assume no specific hooks were defined for the `restart` API command. If `generic_hook` is defined and activated, and `restart` is called, `generic_hook` will effectively run as `pre_restart`, `post_restart`, and `err_restart`. You can use a selection statement on this value so that `generic_hook` will act differently based on whether it is standing in for a `pre_`, `post_`, or `err_` hook.
+
+* `method_name` is the Client API method name whose execution is causing this execution of the `generic_hook`.
+
+* `args*`, `kw**` are the API method arguments and keyword arguments respectively.
+* `result_or_err` is a tri-state parameter taking one of these three values:
+  1. None for `pre_`hooks
+  2. `result` for `post_` nooks
+  3. `exc` for `err_` hooks
+
+Example:
+
+```python
+# Overrides the standard do-nothing generic_hook by adding a log writing operation.
+from twitter.common import log
+  class Logger(object):
+    '''Adds to the log every time a hookable API method is called'''
+    def generic_hook(self, hook_config, event, method_name, result_or_err, *args, **kw)
+      log.info('%s: %s_%s of %s'
+               % (self.__class__.__name__, event, method_name, hook_config.job_key))
+```
+
+## Hooks Process Checklist
+
+1. In your `.aurora` config file, add a `hooks` variable. Note that you may want to define a `.aurora` file only for hook definitions and then include this file in multiple other config files that you want to use the same hooks.
+
+```python
+hooks = []
+```
+
+2. In the `hooks` variable, list all objects that define hooks used by `Job`s defined in this config:
+
+```python
+hooks = [Object_hook_definer1, Object_hook_definer2]
+```
+
+3. For each job that uses hooks in this config file, add `enable_hooks = True` to the `Job` definition. Note that this is necessary even if you only want to use the generic hook.
+
+4. Write your `pre_`, `post_`, and `err_` hook definitions as part of an object definition in your `.aurora` config file.
+
+5. If desired, write your `generic_hook` definition as part of an object definition in your `.aurora` config file. Remember, the object must be listed as a member of `hooks`.
+
+6. If your Aurora command line command does not otherwise take an `.aurora` config file argument, add the appropriate `.aurora` file as an argument in order to define and activate the configuration's hooks.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/CPUavailability.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/CPUavailability.png b/source/documentation/latest/images/CPUavailability.png
new file mode 100644
index 0000000..eb4f5c1
Binary files /dev/null and b/source/documentation/latest/images/CPUavailability.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/HelloWorldJob.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/HelloWorldJob.png b/source/documentation/latest/images/HelloWorldJob.png
new file mode 100644
index 0000000..7a89575
Binary files /dev/null and b/source/documentation/latest/images/HelloWorldJob.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/RoleJobs.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/RoleJobs.png b/source/documentation/latest/images/RoleJobs.png
new file mode 100644
index 0000000..d41ee0b
Binary files /dev/null and b/source/documentation/latest/images/RoleJobs.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/ScheduledJobs.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/ScheduledJobs.png b/source/documentation/latest/images/ScheduledJobs.png
new file mode 100644
index 0000000..21bfcae
Binary files /dev/null and b/source/documentation/latest/images/ScheduledJobs.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/TaskBreakdown.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/TaskBreakdown.png b/source/documentation/latest/images/TaskBreakdown.png
new file mode 100644
index 0000000..125a4e9
Binary files /dev/null and b/source/documentation/latest/images/TaskBreakdown.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/aurora_hierarchy.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/aurora_hierarchy.png b/source/documentation/latest/images/aurora_hierarchy.png
new file mode 100644
index 0000000..03d596d
Binary files /dev/null and b/source/documentation/latest/images/aurora_hierarchy.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/killedtask.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/killedtask.png b/source/documentation/latest/images/killedtask.png
new file mode 100644
index 0000000..b173698
Binary files /dev/null and b/source/documentation/latest/images/killedtask.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/lifeofatask.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/lifeofatask.png b/source/documentation/latest/images/lifeofatask.png
new file mode 100644
index 0000000..e94d160
Binary files /dev/null and b/source/documentation/latest/images/lifeofatask.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/runningtask.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/runningtask.png b/source/documentation/latest/images/runningtask.png
new file mode 100644
index 0000000..7f7553c
Binary files /dev/null and b/source/documentation/latest/images/runningtask.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/stderr.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/stderr.png b/source/documentation/latest/images/stderr.png
new file mode 100644
index 0000000..b83ccfa
Binary files /dev/null and b/source/documentation/latest/images/stderr.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/stdout.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/stdout.png b/source/documentation/latest/images/stdout.png
new file mode 100644
index 0000000..fb4e0b7
Binary files /dev/null and b/source/documentation/latest/images/stdout.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/images/storage_hierarchy.png
----------------------------------------------------------------------
diff --git a/source/documentation/latest/images/storage_hierarchy.png b/source/documentation/latest/images/storage_hierarchy.png
new file mode 100644
index 0000000..621d2d0
Binary files /dev/null and b/source/documentation/latest/images/storage_hierarchy.png differ

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/monitoring.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/monitoring.md b/source/documentation/latest/monitoring.md
new file mode 100644
index 0000000..8aee669
--- /dev/null
+++ b/source/documentation/latest/monitoring.md
@@ -0,0 +1,206 @@
+# Monitoring your Aurora cluster
+
+Before you start running important services in your Aurora cluster, it's important to set up
+monitoring and alerting of Aurora itself.  Most of your monitoring can be against the scheduler,
+since it will give you a global view of what's going on.
+
+## Reading stats
+The scheduler exposes a *lot* of instrumentation data via its HTTP interface. You can get a quick
+peek at the first few of these in our vagrant image:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars | head'
+    async_tasks_completed 1004
+    attribute_store_fetch_all_events 15
+    attribute_store_fetch_all_events_per_sec 0.0
+    attribute_store_fetch_all_nanos_per_event 0.0
+    attribute_store_fetch_all_nanos_total 3048285
+    attribute_store_fetch_all_nanos_total_per_sec 0.0
+    attribute_store_fetch_one_events 3391
+    attribute_store_fetch_one_events_per_sec 0.0
+    attribute_store_fetch_one_nanos_per_event 0.0
+    attribute_store_fetch_one_nanos_total 454690753
+
+These values are served as `Content-Type: text/plain`, with each line containing a space-separated metric
+name and value. Values may be integers, doubles, or strings (note: strings are static, others
+may be dynamic).
+
+If your monitoring infrastructure prefers JSON, the scheduler exports that as well:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
+    {
+        "async_tasks_completed": 1009,
+        "attribute_store_fetch_all_events": 15,
+        "attribute_store_fetch_all_events_per_sec": 0.0,
+        "attribute_store_fetch_all_nanos_per_event": 0.0,
+        "attribute_store_fetch_all_nanos_total": 3048285,
+        "attribute_store_fetch_all_nanos_total_per_sec": 0.0,
+        "attribute_store_fetch_one_events": 3409,
+        "attribute_store_fetch_one_events_per_sec": 0.0,
+        "attribute_store_fetch_one_nanos_per_event": 0.0,
+
+This will be the same data as above, served with `Content-Type: application/json`.
+
+## Viewing live stat samples on the scheduler
+The scheduler uses the Twitter commons stats library, which keeps an internal time-series database
+of exported variables - nearly everything in `/vars` is available for instant graphing.  This is
+useful for debugging, but is not a replacement for an external monitoring system.
+
+You can view these graphs on a scheduler at `/graphview`.  It supports some composition and
+aggregation of values, which can be invaluable when triaging a problem.  For example, if you have
+the scheduler running in vagrant, check out these links:
+[simple graph](http://192.168.33.7:8081/graphview?query=jvm_uptime_secs)
+[complex composition](http://192.168.33.7:8081/graphview?query=rate\(scheduler_log_native_append_nanos_total\)%2Frate\(scheduler_log_native_append_events\)%2F1e6)
+
+### Counters and gauges
+Among numeric stats, there are two fundamental types of stats exported: _counters_ and _gauges_.
+Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges
+may decrease in value.  Aurora uses counters to represent things like the number of times an event
+has occurred, and gauges to capture things like the current length of a queue.  Counters are a
+natural fit for accurate composition into [rate ratios](http://en.wikipedia.org/wiki/Rate_ratio)
+(useful for sample-resistant latency calculation), while gauges are not.
+
+# Alerting
+
+## Quickstart
+If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting
+on `framework_registered` and `task_store_LOST`. These will give you a decent picture of overall
+health.
+
+## A note on thresholds
+One of the most difficult things in monitoring is choosing alert thresholds. With many of these
+stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It
+will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We
+recommend you start with a strict value after viewing a small amount of collected data, and then
+adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts
+and thresholds make sense.
+
+#### `jvm_uptime_secs`
+Type: integer counter
+
+#### Description
+The number of seconds the JVM process has been running. Comes from
+[RuntimeMXBean#getUptime()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime\(\))
+
+#### Alerting
+Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to
+stay alive.
+
+#### Triage
+Look at the scheduler logs to identify the reason the scheduler is exiting.
+
+#### `system_load_avg`
+Type: double gauge
+
+#### Description
+The current load average of the system for the last minute. Comes from
+[OperatingSystemMXBean#getSystemLoadAverage()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage\(\)).
+
+#### Alerting
+A high sustained value suggests that the scheduler machine may be over-utilized.
+
+#### Triage
+Use standard unix tools like `top` and `ps` to track down the offending process(es).
+
+#### `process_cpu_cores_utilized`
+Type: double gauge
+
+#### Description
+The current number of CPU cores in use by the JVM process. This should not exceed the number of
+logical CPU cores on the machine. Derived from
+[OperatingSystemMXBean#getProcessCpuTime()](http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
+
+#### Alerting
+A high sustained value indicates that the scheduler is overworked. Due to current internal design
+limitations, if this value is sustained at `1`, there is a good chance the scheduler is under water.
+
+#### Triage
+There are two main inputs that tend to drive this figure: task scheduling attempts and status
+updates from Mesos.  You may see activity in the scheduler logs to give an indication of where
+time is being spent.  Beyond that, it really takes good familiarity with the code to effectively
+triage this.  We suggest engaging with an Aurora developer.
+
+#### `task_store_LOST`
+Type: integer gauge
+
+#### Description
+The number of tasks stored in the scheduler that are in the `LOST` state, and have been rescheduled.
+
+#### Alerting
+If this value is increasing at a high rate, it is a sign of trouble.
+
+#### Triage
+There are many sources of `LOST` tasks in Mesos: the scheduler, master, slave, and executor can all
+trigger this.  The first step is to look in the scheduler logs for `LOST` to identify where the
+state changes are originating.
+
+#### `scheduler_resource_offers`
+Type: integer counter
+
+#### Description
+The number of resource offers that the scheduler has received.
+
+#### Alerting
+For a healthy scheduler, this value must be increasing over time.
+
+##### Triage
+Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it
+is sending offers. You should also look at the master's web interface to see if it has a large
+number of outstanding offers that it is waiting to be returned.
+
+#### `framework_registered`
+Type: binary integer counter
+
+#### Description
+Will be `1` for the leading scheduler that is registered with the Mesos master, `0` for passive
+schedulers,
+
+#### Alerting
+A sustained period without a `1` (or where `sum() != 1`) warrants investigation.
+
+#### Triage
+If there is no leading scheduler, look in the scheduler and master logs for why.  If there are
+multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical
+bug.
+
+#### `rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`
+Type: rate ratio of integer counters
+
+#### Description
+This composes two counters to compute a windowed figure for the latency of replicated log writes.
+
+#### Alerting
+A hike in this value suggests disk bandwidth contention.
+
+#### Triage
+Look in scheduler logs for any reported oddness with saving to the replicated log. Also use
+standard tools like `vmstat` and `iotop` to identify whether the disk has become slow or
+over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.
+
+#### `timed_out_tasks`
+Type: integer counter
+
+#### Description
+Tracks the number of times the scheduler has given up while waiting
+(for `-transient_task_state_timeout`) to hear back about a task that is in a transient state
+(e.g. `ASSIGNED`, `KILLING`), and has moved to `LOST` before rescheduling.
+
+#### Alerting
+This value is currently known to increase occasionally when the scheduler fails over
+([AURORA-740](https://issues.apache.org/jira/browse/AURORA-740)). However, any large spike in this
+value warrants investigation.
+
+#### Triage
+The scheduler will log when it times out a task. You should trace the task ID of the timed out
+task into the master, slave, and/or executors to determine where the message was dropped.
+
+#### `http_500_responses_events`
+Type: integer counter
+
+#### Description
+The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.
+
+#### Alerting
+An increase warrants investigation.
+
+#### Triage
+Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/resource-isolation.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/resource-isolation.md b/source/documentation/latest/resource-isolation.md
new file mode 100644
index 0000000..7e8d88d
--- /dev/null
+++ b/source/documentation/latest/resource-isolation.md
@@ -0,0 +1,147 @@
+Resource Isolation and Sizing
+=============================
+
+**NOTE**: Resource Isolation and Sizing is very much a work in progress.
+Both user-facing aspects and how it works under the hood are subject to
+change.
+
+- [Introduction](#introduction)
+- [CPU Isolation](#cpu-isolation)
+- [CPU Sizing](#cpu-sizing)
+- [Memory Isolation](#memory-isolation)
+- [Memory Sizing](#memory-sizing)
+- [Disk Space](#disk-space)
+- [Disk Space Sizing](#disk-space-sizing)
+- [Other Resources](#other-resources)
+
+## Introduction
+
+Aurora is a multi-tenant system; a single software instance runs on a
+server, serving multiple clients/tenants. To share resources among
+tenants, it implements isolation of:
+
+* CPU
+* memory
+* disk space
+
+CPU is a soft limit, and handled differently from memory and disk space.
+Too low a CPU value results in throttling your application and
+slowing it down. Memory and disk space are both hard limits; when your
+application goes over these values, it's killed.
+
+Let's look at each resource type in more detail:
+
+## CPU Isolation
+
+Mesos uses a quota based CPU scheduler (the *Completely Fair Scheduler*)
+to provide consistent and predictable performance.  This is effectively
+a guarantee of resources -- you receive at least what you requested, but
+also no more than you've requested.
+
+The scheduler gives applications a CPU quota for every 100 ms interval.
+When an application uses its quota for an interval, it is throttled for
+the rest of the 100 ms. Usage resets for each interval and unused
+quota does not carry over.
+
+For example, an application specifying 4.0 CPU has access to 400 ms of
+CPU time every 100 ms. This CPU quota can be used in different ways,
+depending on the application and available resources. Consider the
+scenarios shown in this diagram.
+
+![CPU Availability](images/CPUavailability.png)
+
+* *Scenario A*: the application can use up to 4 cores continuously for
+every 100 ms interval. It is never throttled and starts processing
+new requests immediately.
+
+* *Scenario B* : the application uses up to 8 cores (depending on
+availability) but is throttled after 50 ms. The CPU quota resets at the
+start of each new 100 ms interval.
+
+* *Scenario C* : is like Scenario A, but there is a garbage collection
+event in the second interval that consumes all CPU quota. The
+application throttles for the remaining 75 ms of that interval and
+cannot service requests until the next interval. In this example, the
+garbage collection finished in one interval but, depending on how much
+garbage needs collecting, it may take more than one interval and further
+delay service of requests.
+
+*Technical Note*: Mesos considers logical cores, also known as
+hyperthreading or SMT cores, as the unit of CPU.
+
+## CPU Sizing
+
+To correctly size Aurora-run Mesos tasks, specify a per-shard CPU value
+that lets the task run at its desired performance when at peak load
+distributed across all shards. Include reserve capacity of at least 50%,
+possibly more, depending on how critical your service is (or how
+confident you are about your original estimate : -)), ideally by
+increasing the number of shards to also improve resiliency. When running
+your application, observe its CPU stats over time. If consistently at or
+near your quota during peak load, you should consider increasing either
+per-shard CPU or the number of shards.
+
+## Memory Isolation
+
+Mesos uses dedicated memory allocation. Your application always has
+access to the amount of memory specified in your configuration. The
+application's memory use is defined as the sum of the resident set size
+(RSS) of all processes in a shard. Each shard is considered
+independently.
+
+In other words, say you specified a memory size of 10GB. Each shard
+would receive 10GB of memory. If an individual shard's memory demands
+exceed 10GB, that shard is killed, but the other shards continue
+working.
+
+*Technical note*: Total memory size is not enforced at allocation time,
+so your application can request more than its allocation without getting
+an ENOMEM. However, it will be killed shortly after.
+
+## Memory Sizing
+
+Size for your application's peak requirement. Observe the per-instance
+memory statistics over time, as memory requirements can vary over
+different periods. Remember that if your application exceeds its memory
+value, it will be killed, so you should also add a safety margin of
+around 10-20%. If you have the ability to do so, you may also want to
+put alerts on the per-instance memory.
+
+## Disk Space
+
+Disk space used by your application is defined as the sum of the files'
+disk space in your application's directory, including the `stdout` and
+`stderr` logged from your application. Each shard is considered
+independently. You should use off-node storage for your application's
+data whenever possible.
+
+In other words, say you specified disk space size of 100MB. Each shard
+would receive 100MB of disk space. If an individual shard's disk space
+demands exceed 100MB, that shard is killed, but the other shards
+continue working.
+
+After your application finishes running, its allocated disk space is
+reclaimed. Thus, your job's final action should move any disk content
+that you want to keep, such as logs, to your home file system or other
+less transitory storage. Disk reclamation takes place an undefined
+period after the application finish time; until then, the disk contents
+are still available but you shouldn't count on them being so.
+
+*Technical note* : Disk space is not enforced at write so your
+application can write above its quota without getting an ENOSPC, but it
+will be killed shortly after. This is subject to change.
+
+## Disk Space Sizing
+
+Size for your application's peak requirement. Rotate and discard log
+files as needed to stay within your quota. When running a Java process,
+add the maximum size of the Java heap to your disk space requirement, in
+order to account for an out of memory error dumping the heap
+into the application's sandbox space.
+
+## Other Resources
+
+Other resources, such as network bandwidth, do not have any performance
+guarantees. For some resources, such as memory bandwidth, there are no
+practical sharing methods so some application combinations collocated on
+the same host may cause contention.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/scheduler-storage.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/scheduler-storage.md b/source/documentation/latest/scheduler-storage.md
new file mode 100644
index 0000000..1cd02f8
--- /dev/null
+++ b/source/documentation/latest/scheduler-storage.md
@@ -0,0 +1,47 @@
+# Snapshot Performance
+
+Periodically the scheduler writes a full snapshot of its state to the replicated log. To do this
+it needs to hold a global storage write lock while it writes out this data. In large clusters
+this has been observed to take up to 40 seconds. Long pauses can cause issues in the system,
+including delays in scheduling new tasks.
+
+The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot
+performance: compression and deduplication. Most users will want to enable both compression
+and deduplication.
+
+## Compression
+
+To reduce the size of the snapshot the DEFLATE algorithm can be applied to the serialized bytes
+of the snapshot as they are written to the stream. This reduces the total number of bytes that
+need to be written to the replicated log at the cost of CPU and generally reduces the amount
+of time a snapshot takes. Most users will want to enable both compression and deduplication.
+
+### Enabling Compression
+
+Snapshot compression is enabled via the `-deflate_snapshots` flag. This is the default since
+Aurora 0.5.0. All released versions of Aurora can read both compressed and uncompressed snapshots,
+so there are no backwards compatibility concerns associated with changing this flag.
+
+### Disabling compression
+
+Disable compression by passing `-deflate_snapshots=false`.
+
+## Deduplication
+
+In Aurora 0.6.0 a new snapshot format was introduced. Rather than write one configuration blob
+per Mesos task this format stores each configuration blob once, and each Mesos task with a
+pointer to its blob. This format is not backwards compatible with earlier versions of Aurora.
+
+### Enabling Deduplication
+
+After upgrading Aurora to 0.6.0, enable deduplication with the `-deduplicate_snapshots` flag.
+After the first snapshot the cluster will be using the deduplicated format to write to the
+replicated log. Snapshots are created periodically by the scheduler (according to
+the `-dlog_snapshot_interval` flag). An administrator can also force a snapshot operation with
+`aurora_admin snapshot`.
+
+### Disabling Deduplication
+
+To disable deduplication, for example to rollback to Aurora, restart all of the cluster's
+schedulers with `-deduplicate_snapshots=false` and either wait for a snapshot or force one
+using `aurora_admin snapshot`.

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/sla.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/sla.md b/source/documentation/latest/sla.md
new file mode 100644
index 0000000..14e9108
--- /dev/null
+++ b/source/documentation/latest/sla.md
@@ -0,0 +1,176 @@
+Aurora SLA Measurement
+--------------
+
+- [Overview](#overview)
+- [Metric Details](#metric-details)
+  - [Platform Uptime](#platform-uptime)
+  - [Job Uptime](#job-uptime)
+  - [Median Time To Assigned (MTTA)](#median-time-to-assigned-\(mtta\))
+  - [Median Time To Running (MTTR)](#median-time-to-running-\(mttr\))
+- [Limitations](#limitations)
+
+## Overview
+
+The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level
+Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform
+and hosted services.
+
+The Aurora SLA feature currently supports stat collection only for service (non-cron)
+production jobs (`"production = True"` in your `.aurora` config).
+
+Counters that track SLA measurements are computed periodically within the scheduler.
+The individual instance metrics are refreshed every minute (configurable via
+`sla_stat_refresh_interval`). The instance counters are subsequently aggregated by
+relevant grouping types before exporting to scheduler `/vars` endpoint (when using `vagrant`
+that would be `http://192.168.33.7:8081/vars`)
+
+## Metric Details
+
+### Platform Uptime
+
+*Aggregate amount of time a job spends in a non-runnable state due to platform unavailability
+or scheduling delays. This metric tracks Aurora/Mesos uptime performance and reflects on any
+system-caused downtime events (tasks LOST or DRAINED). Any user-initiated task kills/restarts
+will not degrade this metric.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_platform_uptime_percent`
+* Per cluster - `sla_cluster_platform_uptime_percent`
+
+**Units:** percent
+
+A fault in the task environment may cause the Aurora/Mesos to have different views on the task state
+or lose track of the task existence. In such cases, the service task is marked as LOST and
+rescheduled by Aurora. For example, this may happen when the task stays in ASSIGNED or STARTING
+for too long or the Mesos slave becomes unhealthy (or disappears completely). The time between
+task entering LOST and its replacement reaching RUNNING state is counted towards platform downtime.
+
+Another example of a platform downtime event is the administrator-requested task rescheduling. This
+happens during planned Mesos slave maintenance when all slave tasks are marked as DRAINED and
+rescheduled elsewhere.
+
+To accurately calculate Platform Uptime, we must separate platform incurred downtime from user
+actions that put a service instance in a non-operational state. It is simpler to isolate
+user-incurred downtime and treat all other downtime as platform incurred.
+
+Currently, a user can cause a healthy service (task) downtime in only two ways: via `killTasks`
+or `restartShards` RPCs. For both, their affected tasks leave an audit state transition trail
+relevant to uptime calculations. By applying a special "SLA meaning" to exposed task state
+transition records, we can build a deterministic downtime trace for every given service instance.
+
+A task going through a state transition carries one of three possible SLA meanings
+(see [SlaAlgorithm.java](../src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java) for
+sla-to-task-state mapping):
+
+* Task is UP: starts a period where the task is considered to be up and running from the Aurora
+  platform standpoint.
+
+* Task is DOWN: starts a period where the task cannot reach the UP state for some
+  non-user-related reason. Counts towards instance downtime.
+
+* Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to
+  user initiated action or failure. We ignore this period for the uptime calculation purposes.
+
+This metric is recalculated over the last sampling period (last minute) to account for
+any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the
+sampling interval as well as adjacent REMOVED events.
+
+### Job Uptime
+
+*Percentage of the job instances considered to be in RUNNING state for the specified duration
+relative to request time. This is a purely application side metric that is considering aggregate
+uptime of all RUNNING instances. Any user- or platform initiated restarts directly affect
+this metric.*
+
+**Collection scope:** We currently expose job uptime values at 5 pre-defined
+percentiles (50th,75th,90th,95th and 99th):
+
+* `sla_<job_key>_job_uptime_50_00_sec`
+* `sla_<job_key>_job_uptime_75_00_sec`
+* `sla_<job_key>_job_uptime_90_00_sec`
+* `sla_<job_key>_job_uptime_95_00_sec`
+* `sla_<job_key>_job_uptime_99_00_sec`
+
+**Units:** seconds
+You can also get customized real-time stats from aurora client. See `aurora sla -h` for
+more details.
+
+### Median Time To Assigned (MTTA)
+
+*Median time a job spends waiting for its tasks to be assigned to a host. This is a combined
+metric that helps track the dependency of scheduling performance on the requested resources
+(user scope) as well as the internal scheduler bin-packing algorithm efficiency (platform scope).*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mtta_ms`
+* Per cluster - `sla_cluster_mtta_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceAggregates.java](../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java)
+  * By CPU:
+    * `sla_cpu_small_mtta_ms`
+    * `sla_cpu_medium_mtta_ms`
+    * `sla_cpu_large_mtta_ms`
+    * `sla_cpu_xlarge_mtta_ms`
+    * `sla_cpu_xxlarge_mtta_ms`
+  * By RAM:
+    * `sla_ram_small_mtta_ms`
+    * `sla_ram_medium_mtta_ms`
+    * `sla_ram_large_mtta_ms`
+    * `sla_ram_xlarge_mtta_ms`
+    * `sla_ram_xxlarge_mtta_ms`
+  * By DISK:
+    * `sla_disk_small_mtta_ms`
+    * `sla_disk_medium_mtta_ms`
+    * `sla_disk_large_mtta_ms`
+    * `sla_disk_xlarge_mtta_ms`
+    * `sla_disk_xxlarge_mtta_ms`
+
+**Units:** milliseconds
+
+MTTA only considers instances that have already reached ASSIGNED state and ignores those
+that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource
+constraints) do not affect metric curves.
+
+### Median Time To Running (MTTR)
+
+*Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric
+reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.*
+
+**Collection scope:**
+
+* Per job - `sla_<job_key>_mttr_ms`
+* Per cluster - `sla_cluster_mttr_ms`
+* Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
+[ResourceAggregates.java](../src/main/java/org/apache/aurora/scheduler/base/ResourceAggregates.java)
+  * By CPU:
+    * `sla_cpu_small_mttr_ms`
+    * `sla_cpu_medium_mttr_ms`
+    * `sla_cpu_large_mttr_ms`
+    * `sla_cpu_xlarge_mttr_ms`
+    * `sla_cpu_xxlarge_mttr_ms`
+  * By RAM:
+    * `sla_ram_small_mttr_ms`
+    * `sla_ram_medium_mttr_ms`
+    * `sla_ram_large_mttr_ms`
+    * `sla_ram_xlarge_mttr_ms`
+    * `sla_ram_xxlarge_mttr_ms`
+  * By DISK:
+    * `sla_disk_small_mttr_ms`
+    * `sla_disk_medium_mttr_ms`
+    * `sla_disk_large_mttr_ms`
+    * `sla_disk_xlarge_mttr_ms`
+    * `sla_disk_xxlarge_mttr_ms`
+
+**Units:** milliseconds
+
+MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with
+unreasonable resource constraints) do not affect metric curves.
+
+## Limitations
+
+* The availability of Aurora SLA metrics is bound by the scheduler availability.
+
+* All metrics are calculated at a pre-defined interval (currently set at 1 minute).
+  Scheduler restarts may result in missed collections.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/storage-config.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/storage-config.md b/source/documentation/latest/storage-config.md
new file mode 100644
index 0000000..023d730
--- /dev/null
+++ b/source/documentation/latest/storage-config.md
@@ -0,0 +1,142 @@
+# Storage Configuration And Maintenance
+
+- [Overview](#overview)
+- [Scheduler storage configuration flags](#scheduler-storage-configuration-flags)
+  - [Mesos replicated log configuration flags](#mesos-replicated-log-configuration-flags)
+    - [-native_log_quorum_size](#-native_log_quorum_size)
+    - [-native_log_file_path](#-native_log_file_path)
+    - [-native_log_zk_group_path](#-native_log_zk_group_path)
+  - [Backup configuration flags](#backup-configuration-flags)
+    - [-backup_interval](#-backup_interval)
+    - [-backup_dir](#-backup_dir)
+    - [-max_saved_backups](#-max_saved_backups)
+- [Recovering from a scheduler backup](#recovering-from-a-scheduler-backup)
+  - [Summary](#summary)
+  - [Preparation](#preparation)
+  - [Cleanup and re-initialize Mesos replicated log](#cleanup-and-re-initialize-mesos-replicated-log)
+  - [Restore from backup](#restore-from-backup)
+  - [Cleanup](#cleanup)
+
+## Overview
+
+This document summarizes Aurora storage configuration and maintenance details and is
+intended for use by anyone deploying and/or maintaining Aurora.
+
+For a high level overview of the Aurora storage architecture refer to [this document](/documentation/latest/storage/).
+
+## Scheduler storage configuration flags
+
+Below is a summary of scheduler storage configuration flags that either don't have default values
+or require attention before deploying in a production environment.
+
+### Mesos replicated log configuration flags
+
+#### -native_log_quorum_size
+Defines the Mesos replicated log quorum size. See
+[the replicated log configuration document](deploying-aurora-scheduler.md#replicated-log-configuration)
+on how to choose the right value.
+
+#### -native_log_file_path
+Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD)
+for Mesos replicated log files to ensure optimal storage performance.
+
+#### -native_log_zk_group_path
+ZooKeeper path used for Mesos replicated log quorum discovery.
+
+See [code](../src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java) for
+other available Mesos replicated log configuration options and default values.
+
+### Backup configuration flags
+
+Configuration options for the Aurora scheduler backup manager.
+
+#### -backup_interval
+The interval on which the scheduler writes local storage backups.  The default is every hour.
+
+#### -backup_dir
+Directory to write backups to.
+
+#### -max_saved_backups
+Maximum number of backups to retain before deleting the oldest backup(s).
+
+## Recovering from a scheduler backup
+
+- [Overview](#overview)
+- [Preparation](#preparation)
+- [Assess Mesos replicated log damage](#assess-mesos-replicated-log-damage)
+- [Restore from backup](#restore-from-backup)
+- [Cleanup](#cleanup)
+
+**Be sure to read the entire page before attempting to restore from a backup, as it may have
+unintended consequences.**
+
+### Summary
+
+The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an
+earlier, backed up, version and requires all schedulers to be taken down temporarily while
+restoring. Once completed, the scheduler state resets to what it was when the backup was created.
+This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will
+be killed shortly after the cluster restarts. All other tasks continue operating as normal.
+
+Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few
+hours). This is because the scheduler will expect the cluster to look exactly as the backup does,
+so any tasks that have been rescheduled since the backup was taken will be killed.
+
+### Preparation
+
+Follow these steps to prepare the cluster for restoring from a backup:
+
+* Stop all scheduler instances
+
+* Consider blocking external traffic on a port defined in `-http_port` for all schedulers to
+prevent users from interacting with the scheduler during the restoration process. This will help
+troubleshooting by reducing the scheduler log noise and prevent users from making changes that will
+be erased after the backup snapshot is restored
+
+* Next steps are required to put scheduler into a partially disabled state where it would still be
+able to accept storage recovery requests but unable to schedule or change task states. This may be
+accomplished by updating the following scheduler configuration options:
+  * Set `-mesos_master_address` to a non-existent zk address. This will prevent scheduler from
+    registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:2181`
+  * `-max_registration_delay` - set to sufficiently long interval to prevent registration timeout
+    and as a result scheduler suicide. E.g: `-max_registration_delay=360min`
+  * Make sure `-gc_executor_path` option is not set to prevent accidental task GC. This is
+    important as scheduler will attempt to reconcile the cluster state and will kill all tasks when
+    restarted with an empty Mesos replicated log.
+
+* Restart all schedulers
+
+### Cleanup and re-initialize Mesos replicated log
+
+Get rid of the corrupted files and re-initialize Mesos replicate log:
+
+* Stop schedulers
+* Delete all files under `-native_log_file_path` on all schedulers
+* Initialize Mesos replica's log file: `mesos-log initialize <-native_log_file_path>`
+* Restart schedulers
+
+### Restore from backup
+
+At this point the scheduler is ready to rehydrate from the backup:
+
+* Identify the leading scheduler by:
+  * running `aurora_admin get_scheduler <cluster>` - if scheduler is responsive
+  * examining scheduler logs
+  * or examining Zookeeper registration under the path defined by `-zk_endpoints`
+    and `-serverset_path`
+
+* Locate the desired backup file, copy it to the leading scheduler and stage recovery by running
+the following command on a leader
+`aurora_admin scheduler_stage_recovery <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>`
+
+* At this point, the recovery snapshot is staged and available for manual verification/modification
+via `aurora_admin scheduler_print_recovery_tasks` and `scheduler_delete_recovery_tasks` commands.
+See `aurora_admin help <command>` for usage details.
+
+* Commit recovery. This instructs the scheduler to overwrite the existing Mesosreplicated log with
+the provided backup snapshot and initiate a mandatory failover
+`aurora_admin scheduler_commit_recovery <cluster>`
+
+### Cleanup
+Undo any modification done during [Preparation](#preparation) sequence.
+

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/storage.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/storage.md b/source/documentation/latest/storage.md
new file mode 100644
index 0000000..6ffed54
--- /dev/null
+++ b/source/documentation/latest/storage.md
@@ -0,0 +1,88 @@
+#Aurora Scheduler Storage
+
+- [Overview](#overview)
+- [Reads, writes, modifications](#reads-writes-modifications)
+  - [Read lifecycle](#read-lifecycle)
+  - [Write lifecycle](#write-lifecycle)
+- [Atomicity, consistency and isolation](#atomicity-consistency-and-isolation)
+- [Population on restart](#population-on-restart)
+
+## Overview
+
+Aurora scheduler maintains data that need to be persisted to survive failovers and restarts.
+For example:
+
+* Task configurations and scheduled task instances
+* Job update configurations and update progress
+* Production resource quotas
+* Mesos resource offer host attributes
+
+Aurora solves its persistence needs by leveraging the Mesos implementation of a Paxos replicated
+log [[1]](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf)
+[[2]](http://en.wikipedia.org/wiki/State_machine_replication) with a key-value
+[LevelDB](https://github.com/google/leveldb) storage as persistence media.
+
+Conceptually, it can be represented by the following major components:
+
+* Volatile storage: in-memory cache of all available data. Implemented via in-memory
+[H2 Database](http://www.h2database.com/html/main.html) and accessed via
+[MyBatis](http://mybatis.github.io/mybatis-3/).
+* Log manager: interface between Aurora storage and Mesos replicated log. The default schema format
+is [thrift](https://github.com/apache/thrift). Data is stored in serialized binary form.
+* Snapshot manager: all data is periodically persisted in Mesos replicated log in a single snapshot.
+This helps establishing periodic recovery checkpoints and speeds up volatile storage recovery on
+restart.
+* Backup manager: as a precaution, snapshots are periodically written out into backup files.
+This solves a [disaster recovery problem](storage-config.md#recovering-from-a-scheduler-backup)
+in case of a complete loss or corruption of Mesos log files.
+
+![Storage hierarchy](images/storage_hierarchy.png)
+
+## Reads, writes, modifications
+
+All services in Aurora access data via a set of predefined store interfaces (aka stores) logically
+grouped by the type of data they serve. Every interface defines a specific set of operations allowed
+on the data thus abstracting out the storage access and the actual persistence implementation. The
+latter is especially important in view of a general immutability of persisted data. With the Mesos
+replicated log as the underlying persistence solution, data can be read and written easily but not
+modified. All modifications are simulated by saving new versions of modified objects. This feature
+and general performance considerations justify the existence of the volatile in-memory store.
+
+### Read lifecycle
+
+There are two types of reads available in Aurora: consistent and weakly-consistent. The difference
+is explained [below](#atomicity-and-isolation).
+
+All reads are served from the volatile storage making reads generally cheap storage operations
+from the performance standpoint. The majority of the volatile stores are represented by the
+in-memory H2 database. This allows for rich schema definitions, queries and relationships that
+key-value storage is unable to match.
+
+### Write lifecycle
+
+Writes are more involved operations since in addition to updating the volatile store data has to be
+appended to the replicated log. Data is not available for reads until fully ack-ed by both
+replicated log and volatile storage.
+
+## Atomicity, consistency and isolation
+
+Aurora uses [write-ahead logging](http://en.wikipedia.org/wiki/Write-ahead_logging) to ensure
+consistency between replicated and volatile storage. In Aurora, data is first written into the
+replicated log and only then updated in the volatile store.
+
+Aurora storage uses read-write locks to serialize data mutations and provide consistent view of the
+available data. The available `Storage` interface exposes 3 major types of operations:
+* `consistentRead` - access is locked using reader's lock and provides consistent view on read
+* `weaklyConsistentRead` - access is lock-less. Delivers best contention performance but may result
+in stale reads
+* `write` - access is fully serialized by using writer's lock. Operation success requires both
+volatile and replicated writes to succeed.
+
+The consistency of the volatile store is enforced via H2 transactional isolation.
+
+## Population on restart
+
+Any time a scheduler restarts, it restores its volatile state from the most recent position recorded
+in the replicated log by restoring the snapshot and replaying individual log entries on top to fully
+recover the state up to the last write.
+

http://git-wip-us.apache.org/repos/asf/incubator-aurora-website/blob/c43a3a2d/source/documentation/latest/test-resource-generation.md
----------------------------------------------------------------------
diff --git a/source/documentation/latest/test-resource-generation.md b/source/documentation/latest/test-resource-generation.md
new file mode 100644
index 0000000..1969385
--- /dev/null
+++ b/source/documentation/latest/test-resource-generation.md
@@ -0,0 +1,25 @@
+# Generating test resources
+
+## Background
+The Aurora source repository and distributions contain several
+[binary files](../src/test/resources/org/apache/thermos/root/checkpoints) to
+qualify the backwards-compatibility of thermos with checkpoint data. Since
+thermos persists state to disk, to be read by other components (the GC executor
+and the thermos observer), it is important that we have tests that prevent
+regressions affecting the ability to parse previously-written data.
+
+## Generating test files
+The files included represent persisted checkpoints that exercise different
+features of thermos. The existing files should not be modified unless
+we are accepting backwards incompatibility, such as with a major release.
+
+It is not practical to write source code to generate these files on the fly,
+as source would be vulnerable to drift (e.g. due to refactoring) in ways
+that would undermine the goal of ensuring backwards compatibility.
+
+The most common reason to add a new checkpoint file would be to provide
+coverage for new thermos features that alter the data format. This is
+accomplished by writing and running a
+[job configuration](/documentation/latest/configuration-reference/) that exercises the feature, and
+copying the checkpoint file from the sandbox directory, by default this is
+`/var/run/thermos/checkpoints/<aurora task id>`.


Mime
View raw message