aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From san...@apache.org
Subject svn commit: r1799392 [12/14] - in /aurora/site: publish/blog/aurora-0-18-0-released/ publish/documentation/0.18.0/ publish/documentation/0.18.0/additional-resources/ publish/documentation/0.18.0/additional-resources/presentations/ publish/documentation...
Date Wed, 21 Jun 2017 06:36:25 GMT
Added: aurora/site/source/documentation/0.18.0/operations/installation.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/installation.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/installation.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/installation.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,256 @@
+# Installing Aurora
+
+Source and binary distributions can be found on our
+[downloads](https://aurora.apache.org/downloads/) page.  Installing from binary packages is
+recommended for most.
+
+- [Installing the scheduler](#installing-the-scheduler)
+- [Installing worker components](#installing-worker-components)
+- [Installing the client](#installing-the-client)
+- [Installing Mesos](#installing-mesos)
+- [Troubleshooting](#troubleshooting)
+
+If our binay packages don't suite you, our package build toolchain makes it easy to build your
+own packages. See the [instructions](https://github.com/apache/aurora-packaging) to learn how.
+
+
+## Machine profiles
+
+Given that many of these components communicate over the network, there are numerous ways you could
+assemble them to create an Aurora cluster.  The simplest way is to think in terms of three machine
+profiles:
+
+### Coordinator
+**Components**: ZooKeeper, Aurora scheduler, Mesos master
+
+A small number of machines (typically 3 or 5) responsible for cluster orchestration.  In most cases
+it is fine to co-locate these components in anything but very large clusters (> 1000 machines).
+Beyond that point, operators will likely want to manage these services on separate machines.
+In particular, you will want to use separate ZooKeeper ensembles for leader election and
+service discovery. Otherwise a service discovery error or outage can take down the entire cluster.
+
+In practice, 5 coordinators have been shown to reliably manage clusters with tens of thousands of
+machines.
+
+### Worker
+**Components**: Aurora executor, Aurora observer, Mesos agent
+
+The bulk of the cluster, where services will actually run.
+
+### Client
+**Components**: Aurora client, Aurora admin client
+
+Any machines that users submit jobs from.
+
+
+## Installing the scheduler
+### Ubuntu Trusty
+
+1. Install Mesos
+   Skip down to [install mesos](#mesos-on-ubuntu-trusty), then run:
+
+        sudo start mesos-master
+
+2. Install ZooKeeper
+
+        sudo apt-get install -y zookeeperd
+
+3. Install the Aurora scheduler
+
+        sudo add-apt-repository -y ppa:openjdk-r/ppa
+        sudo apt-get update
+        sudo apt-get install -y openjdk-8-jre-headless wget
+
+        sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
+
+        wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-scheduler_0.17.0_amd64.deb
+        sudo dpkg -i aurora-scheduler_0.17.0_amd64.deb
+
+### CentOS 7
+
+1. Install Mesos
+   Skip down to [install mesos](#mesos-on-centos-7), then run:
+
+        sudo systemctl start mesos-master
+
+2. Install ZooKeeper
+
+        sudo rpm -Uvh https://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
+        sudo yum install -y java-1.8.0-openjdk-headless zookeeper-server
+
+        sudo service zookeeper-server init
+        sudo systemctl start zookeeper-server
+
+3. Install the Aurora scheduler
+
+        sudo yum install -y wget
+
+        wget -c https://apache.bintray.com/aurora/centos-7/aurora-scheduler-0.17.0-1.el7.centos.aurora.x86_64.rpm
+        sudo yum install -y aurora-scheduler-0.17.0-1.el7.centos.aurora.x86_64.rpm
+
+### Finalizing
+By default, the scheduler will start in an uninitialized mode.  This is because external
+coordination is necessary to be certain operator error does not result in a quorum of schedulers
+starting up and believing their databases are empty when in fact they should be re-joining a
+cluster.
+
+Because of this, a fresh install of the scheduler will need intervention to start up.  First,
+stop the scheduler service.
+Ubuntu: `sudo stop aurora-scheduler`
+CentOS: `sudo systemctl stop aurora`
+
+Now initialize the database:
+
+    sudo -u aurora mkdir -p /var/lib/aurora/scheduler/db
+    sudo -u aurora mesos-log initialize --path=/var/lib/aurora/scheduler/db
+
+Now you can start the scheduler back up.
+Ubuntu: `sudo start aurora-scheduler`
+CentOS: `sudo systemctl start aurora`
+
+
+## Installing worker components
+### Ubuntu Trusty
+
+1. Install Mesos
+   Skip down to [install mesos](#mesos-on-ubuntu-trusty), then run:
+
+        start mesos-slave
+
+2. Install Aurora executor and observer
+
+        sudo apt-get install -y python2.7 wget
+
+        # NOTE: This appears to be a missing dependency of the mesos deb package and is needed
+        # for the python mesos native bindings.
+        sudo apt-get -y install libcurl4-nss-dev
+
+        wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-executor_0.17.0_amd64.deb
+        sudo dpkg -i aurora-executor_0.17.0_amd64.deb
+
+### CentOS 7
+
+1. Install Mesos
+   Skip down to [install mesos](#mesos-on-centos-7), then run:
+
+        sudo systemctl start mesos-slave
+
+2. Install Aurora executor and observer
+
+        sudo yum install -y python2 wget
+
+        wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm
+        sudo yum install -y aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm
+
+### Worker Configuration
+The executor typically does not require configuration.  Command line arguments can
+be passed to the executor using a command line argument on the scheduler.
+
+The observer needs to be configured to look at the correct mesos directory in order to find task
+sandboxes. You should 1st find the Mesos working directory by looking for the Mesos agent
+`--work_dir` flag. You should see something like:
+
+        ps -eocmd | grep "mesos-slave" | grep -v grep | tr ' ' '\n' | grep "\--work_dir"
+        --work_dir=/var/lib/mesos
+
+If the flag is not set, you can view the default value like so:
+
+        mesos-slave --help
+        Usage: mesos-slave [options]
+
+          ...
+          --work_dir=VALUE      Directory path to place framework work directories
+                                (default: /tmp/mesos)
+          ...
+
+The value you find for `--work_dir`, `/var/lib/mesos` in this example, should match the Aurora
+observer value for `--mesos-root`.  You can look for that setting in a similar way on a worker
+node by grepping for `thermos_observer` and `--mesos-root`.  If the flag is not set, you can view
+the default value like so:
+
+        thermos_observer -h
+        Options:
+          ...
+          --mesos-root=MESOS_ROOT
+                                The mesos root directory to search for Thermos
+                                executor sandboxes [default: /var/lib/mesos]
+          ...
+
+In this case the default is `/var/lib/mesos` and we have a match. If there is no match, you can
+either adjust the mesos-master start script(s) and restart the master(s) or else adjust the
+Aurora observer start scripts and restart the observers.  To adjust the Aurora observer:
+
+#### Ubuntu Trusty
+
+    sudo sh -c 'echo "MESOS_ROOT=/tmp/mesos" >> /etc/default/thermos'
+
+#### CentOS 7
+
+Make an edit to add the `--mesos-root` flag resulting in something like:
+
+    grep -A5 OBSERVER_ARGS /etc/sysconfig/thermos
+    OBSERVER_ARGS=(
+      --port=1338
+      --mesos-root=/tmp/mesos
+      --log_to_disk=NONE
+      --log_to_stderr=google:INFO
+    )
+
+
+## Installing the client
+### Ubuntu Trusty
+
+    sudo apt-get install -y python2.7 wget
+
+    wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-tools_0.17.0_amd64.deb
+    sudo dpkg -i aurora-tools_0.17.0_amd64.deb
+
+### CentOS 7
+
+    sudo yum install -y python2 wget
+
+    wget -c https://apache.bintray.com/aurora/centos-7/aurora-tools-0.17.0-1.el7.centos.aurora.x86_64.rpm
+    sudo yum install -y aurora-tools-0.17.0-1.el7.centos.aurora.x86_64.rpm
+
+### Mac OS X
+
+    brew upgrade
+    brew install aurora-cli
+
+### Client Configuration
+Client configuration lives in a json file that describes the clusters available and how to reach
+them.  By default this file is at `/etc/aurora/clusters.json`.
+
+Jobs may be submitted to the scheduler using the client, and are described with
+[job configurations](../../reference/configuration/) expressed in `.aurora` files.  Typically you will
+maintain a single job configuration file to describe one or more deployment environments (e.g.
+dev, test, prod) for a production job.
+
+
+## Installing Mesos
+Mesos uses a single package for the Mesos master and agent.  As a result, the package dependencies
+are identical for both.
+
+### Mesos on Ubuntu Trusty
+
+    sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
+    DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
+    CODENAME=$(lsb_release -cs)
+
+    echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
+      sudo tee /etc/apt/sources.list.d/mesosphere.list
+    sudo apt-get -y update
+
+    # Use `apt-cache showpkg mesos | grep [version]` to find the exact version.
+    sudo apt-get -y install mesos=1.1.0-2.0.107.ubuntu1404_amd64.deb
+
+### Mesos on CentOS 7
+
+    sudo rpm -Uvh https://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
+    sudo yum -y install mesos-1.1.0
+
+
+## Troubleshooting
+
+So you've started your first cluster and are running into some issues? We've collected some common
+stumbling blocks and solutions in our [Troubleshooting guide](../troubleshooting/) to help get you moving.

Added: aurora/site/source/documentation/0.18.0/operations/monitoring.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/monitoring.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/monitoring.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/monitoring.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,181 @@
+# Monitoring your Aurora cluster
+
+Before you start running important services in your Aurora cluster, it's important to set up
+monitoring and alerting of Aurora itself.  Most of your monitoring can be against the scheduler,
+since it will give you a global view of what's going on.
+
+## Reading stats
+The scheduler exposes a *lot* of instrumentation data via its HTTP interface. You can get a quick
+peek at the first few of these in our vagrant image:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars | head'
+    async_tasks_completed 1004
+    attribute_store_fetch_all_events 15
+    attribute_store_fetch_all_events_per_sec 0.0
+    attribute_store_fetch_all_nanos_per_event 0.0
+    attribute_store_fetch_all_nanos_total 3048285
+    attribute_store_fetch_all_nanos_total_per_sec 0.0
+    attribute_store_fetch_one_events 3391
+    attribute_store_fetch_one_events_per_sec 0.0
+    attribute_store_fetch_one_nanos_per_event 0.0
+    attribute_store_fetch_one_nanos_total 454690753
+
+These values are served as `Content-Type: text/plain`, with each line containing a space-separated metric
+name and value. Values may be integers, doubles, or strings (note: strings are static, others
+may be dynamic).
+
+If your monitoring infrastructure prefers JSON, the scheduler exports that as well:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
+    {
+        "async_tasks_completed": 1009,
+        "attribute_store_fetch_all_events": 15,
+        "attribute_store_fetch_all_events_per_sec": 0.0,
+        "attribute_store_fetch_all_nanos_per_event": 0.0,
+        "attribute_store_fetch_all_nanos_total": 3048285,
+        "attribute_store_fetch_all_nanos_total_per_sec": 0.0,
+        "attribute_store_fetch_one_events": 3409,
+        "attribute_store_fetch_one_events_per_sec": 0.0,
+        "attribute_store_fetch_one_nanos_per_event": 0.0,
+
+This will be the same data as above, served with `Content-Type: application/json`.
+
+## Viewing live stat samples on the scheduler
+The scheduler uses the Twitter commons stats library, which keeps an internal time-series database
+of exported variables - nearly everything in `/vars` is available for instant graphing.  This is
+useful for debugging, but is not a replacement for an external monitoring system.
+
+You can view these graphs on a scheduler at `/graphview`.  It supports some composition and
+aggregation of values, which can be invaluable when triaging a problem.  For example, if you have
+the scheduler running in vagrant, check out these links:
+[simple graph](http://192.168.33.7:8081/graphview?query=jvm_uptime_secs)
+[complex composition](http://192.168.33.7:8081/graphview?query=rate\(scheduler_log_native_append_nanos_total\)%2Frate\(scheduler_log_native_append_events\)%2F1e6)
+
+### Counters and gauges
+Among numeric stats, there are two fundamental types of stats exported: _counters_ and _gauges_.
+Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges
+may decrease in value.  Aurora uses counters to represent things like the number of times an event
+has occurred, and gauges to capture things like the current length of a queue.  Counters are a
+natural fit for accurate composition into [rate ratios](http://en.wikipedia.org/wiki/Rate_ratio)
+(useful for sample-resistant latency calculation), while gauges are not.
+
+# Alerting
+
+## Quickstart
+If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting
+on `framework_registered` and `task_store_LOST`. These will give you a decent picture of overall
+health.
+
+## A note on thresholds
+One of the most difficult things in monitoring is choosing alert thresholds. With many of these
+stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It
+will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We
+recommend you start with a strict value after viewing a small amount of collected data, and then
+adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts
+and thresholds make sense.
+
+## Important stats
+
+### `jvm_uptime_secs`
+Type: integer counter
+
+The number of seconds the JVM process has been running. Comes from
+[RuntimeMXBean#getUptime()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime\(\))
+
+Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to
+stay alive.
+
+Look at the scheduler logs to identify the reason the scheduler is exiting.
+
+### `system_load_avg`
+Type: double gauge
+
+The current load average of the system for the last minute. Comes from
+[OperatingSystemMXBean#getSystemLoadAverage()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage\(\)).
+
+A high sustained value suggests that the scheduler machine may be over-utilized.
+
+Use standard unix tools like `top` and `ps` to track down the offending process(es).
+
+### `process_cpu_cores_utilized`
+Type: double gauge
+
+The current number of CPU cores in use by the JVM process. This should not exceed the number of
+logical CPU cores on the machine. Derived from
+[OperatingSystemMXBean#getProcessCpuTime()](http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
+
+A high sustained value indicates that the scheduler is overworked. Due to current internal design
+limitations, if this value is sustained at `1`, there is a good chance the scheduler is under water.
+
+There are two main inputs that tend to drive this figure: task scheduling attempts and status
+updates from Mesos.  You may see activity in the scheduler logs to give an indication of where
+time is being spent.  Beyond that, it really takes good familiarity with the code to effectively
+triage this.  We suggest engaging with an Aurora developer.
+
+### `task_store_LOST`
+Type: integer gauge
+
+The number of tasks stored in the scheduler that are in the `LOST` state, and have been rescheduled.
+
+If this value is increasing at a high rate, it is a sign of trouble.
+
+There are many sources of `LOST` tasks in Mesos: the scheduler, master, agent, and executor can all
+trigger this.  The first step is to look in the scheduler logs for `LOST` to identify where the
+state changes are originating.
+
+### `scheduler_resource_offers`
+Type: integer counter
+
+The number of resource offers that the scheduler has received.
+
+For a healthy scheduler, this value must be increasing over time.
+
+Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it
+is sending offers. You should also look at the master's web interface to see if it has a large
+number of outstanding offers that it is waiting to be returned.
+
+### `framework_registered`
+Type: binary integer counter
+
+Will be `1` for the leading scheduler that is registered with the Mesos master, `0` for passive
+schedulers,
+
+A sustained period without a `1` (or where `sum() != 1`) warrants investigation.
+
+If there is no leading scheduler, look in the scheduler and master logs for why.  If there are
+multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical
+bug.
+
+### `rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`
+Type: rate ratio of integer counters
+
+This composes two counters to compute a windowed figure for the latency of replicated log writes.
+
+A hike in this value suggests disk bandwidth contention.
+
+Look in scheduler logs for any reported oddness with saving to the replicated log. Also use
+standard tools like `vmstat` and `iotop` to identify whether the disk has become slow or
+over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.
+
+### `timed_out_tasks`
+Type: integer counter
+
+Tracks the number of times the scheduler has given up while waiting
+(for `-transient_task_state_timeout`) to hear back about a task that is in a transient state
+(e.g. `ASSIGNED`, `KILLING`), and has moved to `LOST` before rescheduling.
+
+This value is currently known to increase occasionally when the scheduler fails over
+([AURORA-740](https://issues.apache.org/jira/browse/AURORA-740)). However, any large spike in this
+value warrants investigation.
+
+The scheduler will log when it times out a task. You should trace the task ID of the timed out
+task into the master, agent, and/or executors to determine where the message was dropped.
+
+### `http_500_responses_events`
+Type: integer counter
+
+The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.
+
+An increase warrants investigation.
+
+Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.

Added: aurora/site/source/documentation/0.18.0/operations/security.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/security.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/security.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/security.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,362 @@
+Securing your Aurora Cluster
+============================
+
+Aurora integrates with [Apache Shiro](http://shiro.apache.org/) to provide security
+controls for its API. In addition to providing some useful features out of the box, Shiro
+also allows Aurora cluster administrators to adapt the security system to their organization’s
+existing infrastructure. The announcer in the Aurora thermos executor also supports security
+controls for talking to ZooKeeper.
+
+
+- [Enabling Security](#enabling-security)
+- [Authentication](#authentication)
+	- [HTTP Basic Authentication](#http-basic-authentication)
+		- [Server Configuration](#server-configuration)
+		- [Client Configuration](#client-configuration)
+	- [HTTP SPNEGO Authentication (Kerberos)](#http-spnego-authentication-kerberos)
+		- [Server Configuration](#server-configuration-1)
+		- [Client Configuration](#client-configuration-1)
+- [Authorization](#authorization)
+	- [Using an INI file to define security controls](#using-an-ini-file-to-define-security-controls)
+		- [Caveats](#caveats)
+- [Implementing a Custom Realm](#implementing-a-custom-realm)
+	- [Packaging a realm module](#packaging-a-realm-module)
+- [Announcer Authentication](#announcer-authentication)
+    - [ZooKeeper authentication configuration](#zookeeper-authentication-configuration)
+    - [Executor settings](#executor-settings)
+- [Scheduler HTTPS](#scheduler-https)
+- [Known Issues](#known-issues)
+
+# Enabling Security
+
+There are two major components of security:
+[authentication and authorization](http://en.wikipedia.org/wiki/Authentication#Authorization).  A
+cluster administrator may choose the approach used for each, and may also implement custom
+mechanisms for either.  Later sections describe the options available. To enable authentication
+ for the announcer, see [Announcer Authentication](#announcer-authentication)
+
+
+# Authentication
+
+The scheduler must be configured with instructions for how to process authentication
+credentials at a minimum.  There are currently two built-in authentication schemes -
+[HTTP Basic Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication), and
+[SPNEGO](http://en.wikipedia.org/wiki/SPNEGO) (Kerberos).
+
+## HTTP Basic Authentication
+
+Basic Authentication is a very quick way to add *some* security.  It is supported
+by all major browsers and HTTP client libraries with minimal work.  However,
+before relying on Basic Authentication you should be aware of the [security
+considerations](http://tools.ietf.org/html/rfc2617#section-4).
+
+### Server Configuration
+
+At a minimum you need to set 4 command-line flags on the scheduler:
+
+```
+-http_authentication_mechanism=BASIC
+-shiro_realm_modules=INI_AUTHNZ
+-shiro_ini_path=path/to/security.ini
+```
+
+And create a security.ini file like so:
+
+```
+[users]
+sally = apple, admin
+
+[roles]
+admin = *
+```
+
+The details of the security.ini file are explained below. Note that this file contains plaintext,
+unhashed passwords.
+
+### Client Configuration
+
+To configure the client for HTTP Basic authentication, add an entry to ~/.netrc with your credentials
+
+```
+% cat ~/.netrc
+# ...
+
+machine aurora.example.com
+login sally
+password apple
+
+# ...
+```
+
+No changes are required to `clusters.json`.
+
+## HTTP SPNEGO Authentication (Kerberos)
+
+### Server Configuration
+At a minimum you need to set 6 command-line flags on the scheduler:
+
+```
+-http_authentication_mechanism=NEGOTIATE
+-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ
+-kerberos_server_principal=HTTP/aurora.example.com@EXAMPLE.COM
+-kerberos_server_keytab=path/to/aurora.example.com.keytab
+-shiro_ini_path=path/to/security.ini
+```
+
+And create a security.ini file like so:
+
+```
+% cat path/to/security.ini
+[users]
+sally = _, admin
+
+[roles]
+admin = *
+```
+
+What's going on here? First, Aurora must be configured to request Kerberos credentials when presented with an
+unauthenticated request. This is achieved by setting
+
+```
+-http_authentication_mechanism=NEGOTIATE
+```
+
+Next, a Realm module must be configured to **authenticate** the current request using the Kerberos
+credentials that were requested. Aurora ships with a realm module that can do this
+
+```
+-shiro_realm_modules=KERBEROS5_AUTHN[,...]
+```
+
+The Kerberos5Realm requires a keytab file and a server principal name. The principal name will usually
+be in the form `HTTP/aurora.example.com@EXAMPLE.COM`.
+
+```
+-kerberos_server_principal=HTTP/aurora.example.com@EXAMPLE.COM
+-kerberos_server_keytab=path/to/aurora.example.com.keytab
+```
+
+The Kerberos5 realm module is authentication-only. For scheduler security to work you must also
+enable a realm module that provides an Authorizer implementation. For example, to do this using the
+IniShiroRealmModule:
+
+```
+-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ
+```
+
+You can then configure authorization using a security.ini file as described below
+(the password field is ignored). You must configure the realm module with the path to this file:
+
+```
+-shiro_ini_path=path/to/security.ini
+```
+
+### Client Configuration
+To use Kerberos on the client-side you must build Kerberos-enabled client binaries. Do this with
+
+```
+./pants binary src/main/python/apache/aurora/kerberos:kaurora
+./pants binary src/main/python/apache/aurora/kerberos:kaurora_admin
+```
+
+You must also configure each cluster where you've enabled Kerberos on the scheduler
+to use Kerberos authentication. Do this by setting `auth_mechanism` to `KERBEROS`
+in `clusters.json`.
+
+```
+% cat ~/.aurora/clusters.json
+{
+    "devcluser": {
+        "auth_mechanism": "KERBEROS",
+        ...
+    },
+    ...
+}
+```
+
+# Authorization
+Given a means to authenticate the entity a client claims they are, we need to define what privileges they have.
+
+## Using an INI file to define security controls
+
+The simplest security configuration for Aurora is an INI file on the scheduler.  For small
+clusters, or clusters where the users and access controls change relatively infrequently, this is
+likely the preferred approach.  However you may want to avoid this approach if access permissions
+are rapidly changing, or if your access control information already exists in another system.
+
+You can enable INI-based configuration with following scheduler command line arguments:
+
+```
+-http_authentication_mechanism=BASIC
+-shiro_ini_path=path/to/security.ini
+```
+
+*note* As the argument name reveals, this is using Shiro’s
+[IniRealm](http://shiro.apache.org/configuration.html#Configuration-INIConfiguration) behind
+the scenes.
+
+The INI file will contain two sections - users and roles.  Here’s an example for what might
+be in security.ini:
+
+```
+[users]
+sally = apple, admin
+jim = 123456, accounting
+becky = letmein, webapp
+larry = 654321,accounting
+steve = password
+
+[roles]
+admin = *
+accounting = thrift.AuroraAdmin:setQuota
+webapp = thrift.AuroraSchedulerManager:*:webapp
+```
+
+The users section defines user user credentials and the role(s) they are members of.  These lines
+are of the format `<user> = <password>[, <role>...]`.  As you probably noticed, the passwords are
+in plaintext and as a result read access to this file should be restricted.
+
+In this configuration, each user has different privileges for actions in the cluster because
+of the roles they are a part of:
+
+* admin is granted all privileges
+* accounting may adjust the amount of resource quota for any role
+* webapp represents a collection of jobs that represents a service, and its members may create and modify any jobs owned by it
+
+### Caveats
+You might find documentation on the Internet suggesting there are additional sections in `shiro.ini`,
+like `[main]` and `[urls]`. These are not supported by Aurora as it uses a different mechanism to configure
+those parts of Shiro. Think of Aurora's `security.ini` as a subset with only `[users]` and `[roles]` sections.
+
+## Implementing Delegated Authorization
+
+It is possible to leverage Shiro's `runAs` feature by implementing a custom Servlet Filter that provides
+the capability and passing it's fully qualified class name to the command line argument
+`-shiro_after_auth_filter`. The filter is registered in the same filter chain as the Shiro auth filters
+and is placed after the Shiro auth filters in the filter chain. This ensures that the Filter is invoked
+after the Shiro filters have had a chance to authenticate the request.
+
+# Implementing a Custom Realm
+
+Since Aurora’s security is backed by [Apache Shiro](https://shiro.apache.org), you can implement a
+custom [Realm](http://shiro.apache.org/realm.html) to define organization-specific security behavior.
+
+In addition to using Shiro's standard APIs to implement a Realm you can link against Aurora to
+access the type-safe Permissions Aurora uses. See the Javadoc for `org.apache.aurora.scheduler.spi`
+for more information.
+
+## Packaging a realm module
+Package your custom Realm(s) with a Guice module that exposes a `Set<Realm>` multibinding.
+
+```java
+package com.example;
+
+import com.google.inject.AbstractModule;
+import com.google.inject.multibindings.Multibinder;
+import org.apache.shiro.realm.Realm;
+
+public class MyRealmModule extends AbstractModule {
+  @Override
+  public void configure() {
+    Realm myRealm = new MyRealm();
+
+    Multibinder.newSetBinder(binder(), Realm.class).addBinding().toInstance(myRealm);
+  }
+
+  static class MyRealm implements Realm {
+    // Realm implementation.
+  }
+}
+```
+
+To use your module in the scheduler, include it as a realm module based on its fully-qualified
+class name:
+
+```
+-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ,com.example.MyRealmModule
+```
+
+
+# Announcer Authentication
+The Thermos executor can be configured to authenticate with ZooKeeper and include
+an [ACL](https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#sc_ZooKeeperAccessControl)
+on the nodes it creates, which will specify
+the privileges of clients to perform different actions on these nodes.  This
+feature is enabled by specifying an ACL configuration file to the executor with the
+`--announcer-zookeeper-auth-config` command line argument.
+
+When this feature is _not_ enabled, nodes created by the executor will have 'world/all' permission
+(`ZOO_OPEN_ACL_UNSAFE`).  In most production environments, operators should specify an ACL and
+limit access.
+
+## ZooKeeper Authentication Configuration
+The configuration file must be formatted as JSON with the following schema:
+
+```json
+{
+  "auth": [
+    {
+      "scheme": "<scheme>",
+      "credential": "<plain_credential>"
+    }
+  ],
+  "acl": [
+    {
+      "scheme": "<scheme>",
+      "credential": "<plain_credential>",
+      "permissions": {
+        "read": <bool>,
+        "write": <bool>,
+        "create": <bool>,
+        "delete": <bool>,
+        "admin": <bool>
+      }
+    }
+  ]
+}
+```
+
+The `scheme`
+defines the encoding of the credential field.  Note that these fields are passed directly to
+ZooKeeper (except in the case of _digest_ scheme, where the executor will hash and encode
+the credential appropriately before passing it to ZooKeeper). In addition to `acl`, a list of
+authentication credentials must be provided in `auth` to use for the connection.
+
+All properties of the `permissions` object will default to False if not provided.
+
+## Executor settings
+To enable the executor to authenticate against ZK, `--announcer-zookeeper-auth-config` should be
+set to the configuration file.
+
+
+# Scheduler HTTPS
+
+The Aurora scheduler does not provide native HTTPS support ([AURORA-343](https://issues.apache.org/jira/browse/AURORA-343)).
+It is therefore recommended to deploy it behind an HTTPS capable reverse proxy such as nginx or Apache2.
+
+A simple setup is to launch both the reverse proxy and the Aurora scheduler on the same port, but
+bind the reverse proxy to the public IP of the host and the scheduler to localhost:
+
+    -ip=127.0.0.1
+    -http_port=8081
+
+If your clients connect to the scheduler via [`proxy_url`](../../reference/scheduler-configuration/),
+you can update it to `https`. If you use the ZooKeeper based discovery instead, the scheduler
+needs to be launched via
+
+    -serverset_endpoint_name=https
+
+in order to announce its HTTPS support within ZooKeeper.
+
+
+# Known Issues
+
+While the APIs and SPIs we ship with are stable as of 0.8.0, we are aware of several incremental
+improvements. Please follow, vote, or send patches.
+
+Relevant tickets:
+* [AURORA-1248](https://issues.apache.org/jira/browse/AURORA-1248): Client retries 4xx errors
+* [AURORA-1279](https://issues.apache.org/jira/browse/AURORA-1279): Remove kerberos-specific build targets
+* [AURORA-1293](https://issues.apache.org/jira/browse/AURORA-1291): Consider defining a JSON format in place of INI
+* [AURORA-1179](https://issues.apache.org/jira/browse/AURORA-1179): Supported hashed passwords in security.ini
+* [AURORA-1295](https://issues.apache.org/jira/browse/AURORA-1295): Support security for the ReadOnlyScheduler service

Added: aurora/site/source/documentation/0.18.0/operations/storage.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/storage.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/storage.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/storage.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,96 @@
+# Aurora Scheduler Storage
+
+- [Overview](#overview)
+- [Storage Semantics](#storage-semantics)
+  - [Reads, writes, modifications](#reads-writes-modifications)
+    - [Read lifecycle](#read-lifecycle)
+    - [Write lifecycle](#write-lifecycle)
+  - [Atomicity, consistency and isolation](#atomicity-consistency-and-isolation)
+  - [Population on restart](#population-on-restart)
+
+
+## Overview
+
+Aurora scheduler maintains data that need to be persisted to survive failovers and restarts.
+For example:
+
+* Task configurations and scheduled task instances
+* Job update configurations and update progress
+* Production resource quotas
+* Mesos resource offer host attributes
+
+Aurora solves its persistence needs by leveraging the
+[Mesos implementation of a Paxos replicated log](http://mesos.apache.org/documentation/latest/replicated-log-internals/)
+[[1]](https://ramcloud.stanford.edu/~ongaro/userstudy/paxos.pdf)
+[[2]](http://en.wikipedia.org/wiki/State_machine_replication) with a key-value
+[LevelDB](https://github.com/google/leveldb) storage as persistence media.
+
+Conceptually, it can be represented by the following major components:
+
+* Volatile storage: in-memory cache of all available data. Implemented via in-memory
+[H2 Database](http://www.h2database.com/html/main.html) and accessed via
+[MyBatis](http://mybatis.github.io/mybatis-3/).
+* Log manager: interface between Aurora storage and Mesos replicated log. The default schema format
+is [thrift](https://github.com/apache/thrift). Data is stored in serialized binary form.
+* Snapshot manager: all data is periodically persisted in Mesos replicated log in a single snapshot.
+This helps establishing periodic recovery checkpoints and speeds up volatile storage recovery on
+restart.
+* Backup manager: as a precaution, snapshots are periodically written out into backup files.
+This solves a [disaster recovery problem](../backup-restore/)
+in case of a complete loss or corruption of Mesos log files.
+
+![Storage hierarchy](../images/storage_hierarchy.png)
+
+
+## Storage Semantics
+
+Implementation details of the Aurora storage system. Understanding those can sometimes be useful
+when investigating performance issues.
+
+### Reads, writes, modifications
+
+All services in Aurora access data via a set of predefined store interfaces (aka stores) logically
+grouped by the type of data they serve. Every interface defines a specific set of operations allowed
+on the data thus abstracting out the storage access and the actual persistence implementation. The
+latter is especially important in view of a general immutability of persisted data. With the Mesos
+replicated log as the underlying persistence solution, data can be read and written easily but not
+modified. All modifications are simulated by saving new versions of modified objects. This feature
+and general performance considerations justify the existence of the volatile in-memory store.
+
+#### Read lifecycle
+
+There are two types of reads available in Aurora: consistent and weakly-consistent. The difference
+is explained [below](#atomicity-consistency-and-isolation).
+
+All reads are served from the volatile storage making reads generally cheap storage operations
+from the performance standpoint. The majority of the volatile stores are represented by the
+in-memory H2 database. This allows for rich schema definitions, queries and relationships that
+key-value storage is unable to match.
+
+#### Write lifecycle
+
+Writes are more involved operations since in addition to updating the volatile store data has to be
+appended to the replicated log. Data is not available for reads until fully ack-ed by both
+replicated log and volatile storage.
+
+### Atomicity, consistency and isolation
+
+Aurora uses [write-ahead logging](http://en.wikipedia.org/wiki/Write-ahead_logging) to ensure
+consistency between replicated and volatile storage. In Aurora, data is first written into the
+replicated log and only then updated in the volatile store.
+
+Aurora storage uses read-write locks to serialize data mutations and provide consistent view of the
+available data. The available `Storage` interface exposes 3 major types of operations:
+* `consistentRead` - access is locked using reader's lock and provides consistent view on read
+* `weaklyConsistentRead` - access is lock-less. Delivers best contention performance but may result
+in stale reads
+* `write` - access is fully serialized by using writer's lock. Operation success requires both
+volatile and replicated writes to succeed.
+
+The consistency of the volatile store is enforced via H2 transactional isolation.
+
+### Population on restart
+
+Any time a scheduler restarts, it restores its volatile state from the most recent position recorded
+in the replicated log by restoring the snapshot and replaying individual log entries on top to fully
+recover the state up to the last write.

Added: aurora/site/source/documentation/0.18.0/operations/troubleshooting.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/troubleshooting.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/troubleshooting.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/troubleshooting.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,106 @@
+# Troubleshooting
+
+So you've started your first cluster and are running into some issues? We've collected some common
+stumbling blocks and solutions here to help get you moving.
+
+## Replicated log not initialized
+
+### Symptoms
+- Scheduler RPCs and web interface claim `Storage is not READY`
+- Scheduler log repeatedly prints messages like
+
+  ```
+  I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status
+  received a broadcasted recover request
+  I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response
+  from a replica in EMPTY status
+  ```
+
+### Solution
+When you create a new cluster, you need to inform a quorum of schedulers that they are safe to
+consider their database to be empty by [initializing](../installation/#finalizing) the
+replicated log. This is done to prevent the scheduler from modifying the cluster state in the event
+of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path.
+
+
+## No distinct leader elected
+
+### Symptoms
+Either no scheduler or multiple scheduler believe to be leading.
+
+### Solution
+Verify the [network configuration](../configuration/#network-configuration) of the Aurora
+scheduler is correct:
+
+* The `LIBPROCESS_IP:LIBPROCESS_PORT` endpoints must be reachable from all coordinator nodes running
+  a scheduler or a Mesos master.
+* Hostname lookups have to resolve to public ips rather than local ones that cannot be reached
+  from another node.
+
+In addition, double-check the [quota settings](../configuration/#replicated-log-configuration) of the
+replicated log.
+
+
+## Scheduler not registered
+
+### Symptoms
+Scheduler log contains
+
+    Framework has not been registered within the tolerated delay.
+
+### Solution
+Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering
+the master in ZooKeeper, make sure command line argument to the master:
+
+    --zk=zk://$ZK_HOST:2181/mesos/master
+
+is the same as the one on the scheduler:
+
+    -mesos_master_address=zk://$ZK_HOST:2181/mesos/master
+
+
+## Scheduler not running
+
+### Symptoms
+The scheduler process commits suicide regularly. This happens under error conditions, but
+also on purpose in regular intervals.
+
+### Solution
+Aurora is meant to be run under supervision. You have to configure a supervisor like
+[Monit](http://mmonit.com/monit/), [supervisord](http://supervisord.org/), or systemd to run the
+scheduler and restart it whenever it fails or exists on purpose.
+
+Aurora supports an active health checking protocol on its admin HTTP interface - if a `GET /health`
+times out or returns anything other than `200 OK` the scheduler process is unhealthy and should be
+restarted.
+
+For example, monit can be configured with
+
+    if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart
+
+assuming you set `-http_port=8081`.
+
+
+## Executor crashing or hanging
+
+### Symptoms
+Launched task instances never transition to `STARTING` or `RUNNING` but immediately transition
+to `FAILED` or `LOST`.
+
+### Solution
+The executor might be failing due to unknown internal errors such as a missing native dependency
+of the Mesos executor library. Open the Mesos UI and navigate to the failing
+task in question. Inspect the various log files in order to learn about what is going on.
+
+
+## Observer does not discover tasks
+
+### Symptoms
+The observer UI does not list any tasks. When navigating from the scheduler UI to the state of
+a particular task instance the observer returns `Error: 404 Not Found`.
+
+### Solution
+The observer is refreshing its internal state every couple of seconds. If waiting a few seconds
+does not resolve the issue, check that the `--mesos-root` setting of the observer and the
+`--work_dir` option of the Mesos agent are in sync. For details, see our
+[Install instructions](../installation/#worker-configuration).

Added: aurora/site/source/documentation/0.18.0/operations/upgrades.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/operations/upgrades.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/operations/upgrades.md (added)
+++ aurora/site/source/documentation/0.18.0/operations/upgrades.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,41 @@
+# Upgrading Aurora
+
+Aurora can be updated from one version to the next without any downtime or restarts of running
+jobs. The same holds true for Mesos.
+
+Generally speaking, Mesos and Aurora strive for a +1/-1 version compatibility, i.e. all components
+are meant to be forward and backwards compatible for at least one version. This implies it
+does not really matter in which order updates are carried out.
+
+Exceptions to this rule are documented in the [Aurora release-notes](../../../RELEASE-NOTES/)
+and the [Mesos upgrade instructions](https://mesos.apache.org/documentation/latest/upgrades/).
+
+
+## Instructions
+
+To upgrade Aurora, follow these steps:
+
+1. Update the first scheduler instance by updating its software and restarting its process.
+2. Wait until the scheduler is up and its [Replicated Log](../configuration/#replicated-log-configuration)
+   caught up with the other schedulers in the cluster. The log has caught up if `log/recovered` has
+   the value `1`. You can check the metric via `curl LIBPROCESS_IP:LIBPROCESS_PORT/metrics/snapshot`,
+   where ip and port refer to the [libmesos configuration](../configuration/#network-configuration)
+   settings of the scheduler instance.
+3. Proceed with the next scheduler until all instances are updated.
+4. Update the Aurora executor deployed to the compute nodes of your cluster. Jobs will continue
+   running with the old version of the executor, and will only be launched by the new one once
+   they are restarted eventually due to natural cluster churn.
+5. Distribute the new Aurora client to your users.
+
+
+## Best Practices
+
+Even though not absolutely mandatory, we advice to adhere to the following rules:
+
+* Never skip any major or minor releases when updating. If you have to catch up several releases you
+  have to deploy all intermediary versions. Skipping bugfix releases is acceptable though.
+* Verify all updates on a test cluster before touching your production deployments.
+* To minimize the number of failovers during updates, update the currently leading scheduler
+  instance last.
+* Update the Aurora executor on a subset of compute nodes as a canary before deploying the change to
+  the whole fleet.

Added: aurora/site/source/documentation/0.18.0/reference/client-cluster-configuration.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/reference/client-cluster-configuration.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/reference/client-cluster-configuration.md (added)
+++ aurora/site/source/documentation/0.18.0/reference/client-cluster-configuration.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,99 @@
+# Client Cluster Configuration
+
+A cluster configuration file is used by the Aurora client to describe the Aurora clusters with
+which it can communicate. Ultimately this allows client users to reference clusters with short names
+like us-east and eu.
+
+A cluster configuration is formatted as JSON.  The simplest cluster configuration is one that
+communicates with a single (non-leader-elected) scheduler.  For example:
+
+    [{
+      "name": "example",
+      "scheduler_uri": "http://localhost:55555",
+    }]
+
+
+A configuration for a leader-elected scheduler would contain something like:
+
+    [{
+      "name": "example",
+      "zk": "192.168.33.7",
+      "scheduler_zk_path": "/aurora/scheduler"
+    }]
+
+
+The following properties may be set:
+
+  **Property**             | **Type** | **Description**
+  :------------------------| :------- | :--------------
+   **name**                | String   | Cluster name (Required)
+   **slave_root**          | String   | Path to Mesos agent work dir (Required)
+   **slave_run_directory** | String   | Name of Mesos agent run dir (Required)
+   **zk**                  | String   | Hostname of ZooKeeper instance used to resolve Aurora schedulers.
+   **zk_port**             | Integer  | Port of ZooKeeper instance used to locate Aurora schedulers (Default: 2181)
+   **scheduler_zk_path**   | String   | ZooKeeper path under which scheduler instances are registered.
+   **scheduler_uri**       | String   | URI of Aurora scheduler instance.
+   **proxy_url**           | String   | Used by the client to format URLs for display.
+   **auth_mechanism**      | String   | The authentication mechanism to use when communicating with the scheduler. (Default: UNAUTHENTICATED)
+   **docker_registry**     | String   | Used by the client to resolve docker tags.
+
+
+## Details
+
+### `name`
+
+The name of the Aurora cluster represented by this entry. This name will be the `cluster` portion of
+any job keys identifying jobs running within the cluster.
+
+### `slave_root`
+
+The path on the Mesos agents where executing tasks can be found. It is used in combination with the
+`slave_run_directory` property by `aurora task run` and `aurora task ssh` to change into the sandbox
+directory after connecting to the host. This value should match the value passed to `mesos-slave`
+as `-work_dir`.
+
+### `slave_run_directory`
+
+The name of the directory where the task run can be found. This is used in combination with the
+`slave_root` property by `aurora task run` and `aurora task ssh` to change into the sandbox
+directory after connecting to the host. This should almost always be set to `latest`.
+
+### `zk`
+
+The hostname of the ZooKeeper instance used to resolve the Aurora scheduler. Aurora uses ZooKeeper
+to elect a leader. The client will connect to this ZooKeeper instance to determine the current
+leader. This host should match the host passed to the scheduler as `-zk_endpoints`.
+
+### `zk_port`
+
+The port on which the ZooKeeper instance is running. If not set this will default to the standard
+ZooKeeper port of 2181. This port should match the port in the host passed to the scheduler as
+`-zk_endpoints`.
+
+### `scheduler_zk_path`
+
+The path on the ZooKeeper instance under which the Aurora serverset is registered. This value should
+match the value passed to the scheduler as `-serverset_path`.
+
+### `scheduler_uri`
+
+The URI of the scheduler. This would be used in place of the ZooKeeper related configuration above
+in circumstances where direct communication with a single scheduler is needed (e.g. testing
+environments). It is strongly advised to **never** use this property for production deploys.
+
+### `proxy_url`
+
+Instead of using the hostname of the leading scheduler as the base url, if `proxy_url` is set, its
+value will be used instead. In that scenario the value for `proxy_url` would be, for example, the
+URL of your VIP in a loadbalancer or a roundrobin DNS name.
+
+### `auth_mechanism`
+
+The identifier of an authentication mechanism that the client should use when communicating with the
+scheduler. Support for values other than `UNAUTHENTICATED` requires a matching scheduler-side
+[security configuration](../../operations/security/).
+
+### `docker_registry`
+
+The URI of the Docker Registry that will be used by the Aurora client to resolve docker tags to concrete
+image ids, when using the docker binding helper, like `{{docker.image[name][tag]}}`.

Added: aurora/site/source/documentation/0.18.0/reference/client-commands.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/reference/client-commands.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/reference/client-commands.md (added)
+++ aurora/site/source/documentation/0.18.0/reference/client-commands.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,339 @@
+Aurora Client Commands
+======================
+
+- [Introduction](#introduction)
+- [Cluster Configuration](#cluster-configuration)
+- [Job Keys](#job-keys)
+- [Modifying Aurora Client Commands](#modifying-aurora-client-commands)
+- [Regular Jobs](#regular-jobs)
+    - [Creating and Running a Job](#creating-and-running-a-job)
+    - [Running a Command On a Running Job](#running-a-command-on-a-running-job)
+    - [Killing a Job](#killing-a-job)
+    - [Adding Instances](#adding-instances)
+    - [Updating a Job](#updating-a-job)
+        - [Coordinated job updates](#coordinated-job-updates)
+    - [Renaming a Job](#renaming-a-job)
+    - [Restarting Jobs](#restarting-jobs)
+- [Cron Jobs](#cron-jobs)
+- [Comparing Jobs](#comparing-jobs)
+- [Viewing/Examining Jobs](#viewingexamining-jobs)
+    - [Listing Jobs](#listing-jobs)
+    - [Inspecting a Job](#inspecting-a-job)
+    - [Versions](#versions)
+    - [Checking Your Quota](#checking-your-quota)
+    - [Finding a Job on Web UI](#finding-a-job-on-web-ui)
+    - [Getting Job Status](#getting-job-status)
+    - [Opening the Web UI](#opening-the-web-ui)
+    - [SSHing to a Specific Task Machine](#sshing-to-a-specific-task-machine)
+    - [SCPing with Specific Task Machines](#scping-with-specific-task-machines)
+    - [Templating Command Arguments](#templating-command-arguments)
+
+Introduction
+------------
+
+Once you have written an `.aurora` configuration file that describes
+your Job and its parameters and functionality, you interact with Aurora
+using Aurora Client commands. This document describes all of these commands
+and how and when to use them. All Aurora Client commands start with
+`aurora`, followed by the name of the specific command and its
+arguments.
+
+*Job keys* are a very common argument to Aurora commands, as well as the
+gateway to useful information about a Job. Before using Aurora, you
+should read the next section which describes them in detail. The section
+after that briefly describes how you can modify the behavior of certain
+Aurora Client commands, linking to a detailed document about how to do
+that.
+
+This is followed by the Regular Jobs section, which describes the basic
+Client commands for creating, running, and manipulating Aurora Jobs.
+After that are sections on Comparing Jobs and Viewing/Examining Jobs. In
+other words, various commands for getting information and metadata about
+Aurora Jobs.
+
+Cluster Configuration
+---------------------
+
+The client must be able to find a configuration file that specifies available clusters. This file
+declares shorthand names for clusters, which are in turn referenced by job configuration files
+and client commands.
+
+The client will load at most two configuration files, making both of their defined clusters
+available. The first is intended to be a system-installed cluster, using the path specified in
+the environment variable `AURORA_CONFIG_ROOT`, defaulting to `/etc/aurora/clusters.json` if the
+environment variable is not set. The second is a user-installed file, located at
+`~/.aurora/clusters.json`.
+
+For more details on cluster configuration see the
+[Client Cluster Configuration](../client-cluster-configuration/) documentation.
+
+Job Keys
+--------
+
+A job key is a unique system-wide identifier for an Aurora-managed
+Job, for example `cluster1/web-team/test/experiment204`. It is a 4-tuple
+consisting of, in order, *cluster*, *role*, *environment*, and
+*jobname*, separated by /s. Cluster is the name of an Aurora
+cluster. Role is the Unix service account under which the Job
+runs. Environment is a namespace component like `devel`, `test`,
+`prod`, or `stagingN.` Jobname is the Job's name.
+
+The combination of all four values uniquely specifies the Job. If any
+one value is different from that of another job key, the two job keys
+refer to different Jobs. For example, job key
+`cluster1/tyg/prod/workhorse` is different from
+`cluster1/tyg/prod/workcamel` is different from
+`cluster2/tyg/prod/workhorse` is different from
+`cluster2/foo/prod/workhorse` is different from
+`cluster1/tyg/test/workhorse.`
+
+Role names are user accounts existing on the agent machines. If you don't know what accounts
+are available, contact your sysadmin.
+
+Environment names are namespaces; you can count on `prod`, `devel` and `test` existing.
+
+Modifying Aurora Client Commands
+--------------------------------
+
+For certain Aurora Client commands, you can define hook methods that run
+either before or after an action that takes place during the command's
+execution, as well as based on whether the action finished successfully or failed
+during execution. Basically, a hook is code that lets you extend the
+command's actions. The hook executes on the client side, specifically on
+the machine executing Aurora commands.
+
+Hooks can be associated with these Aurora Client commands.
+
+  - `job create`
+  - `job kill`
+  - `job restart`
+
+The process for writing and activating them is complex enough
+that we explain it in a devoted document, [Hooks for Aurora Client API](../client-hooks/).
+
+Regular Jobs
+------------
+
+This section covers Aurora commands related to running, killing,
+renaming, updating, and restarting a basic Aurora Job.
+
+### Creating and Running a Job
+
+    aurora job create <job key> <configuration file>
+
+Creates and then runs a Job with the specified job key based on a `.aurora` configuration file.
+The configuration file may also contain and activate hook definitions.
+
+### Running a Command On a Running Job
+
+    aurora task run CLUSTER/ROLE/ENV/NAME[/INSTANCES] <cmd>
+
+Runs a shell command on all machines currently hosting shards of a
+single Job.
+
+`run` supports the same command line wildcards used to populate a Job's
+commands; i.e. anything in the `{{mesos.*}}` and `{{thermos.*}}`
+namespaces.
+
+### Killing a Job
+
+    aurora job killall CLUSTER/ROLE/ENV/NAME
+
+Kills all Tasks associated with the specified Job, blocking until all
+are terminated. Defaults to killing all instances in the Job.
+
+The `<configuration file>` argument for `kill` is optional. Use it only
+if it contains hook definitions and activations that affect the
+kill command.
+
+### Adding Instances
+
+    aurora job add CLUSTER/ROLE/ENV/NAME/INSTANCE <count>
+
+Adds `<count>` instances to the existing job. The configuration of the new instances is derived from
+an active job instance pointed by the `/INSTANCE` part of the job specification. This command is
+a simpler way to scale out an existing job when an instance with desired task configuration
+already exists. Use `aurora update start` to add instances with a new (updated) configuration.
+
+### Updating a Job
+
+You can manage job updates using the `aurora update` command.  Please see
+[the Job Update documentation](../../features/job-updates/) for more details.
+
+
+### Renaming a Job
+
+Renaming is a tricky operation as downstream clients must be informed of
+the new name. A conservative approach
+to renaming suitable for production services is:
+
+1.  Modify the Aurora configuration file to change the role,
+    environment, and/or name as appropriate to the standardized naming
+    scheme.
+2.  Check that only these naming components have changed
+    with `aurora diff`.
+
+        aurora job diff CLUSTER/ROLE/ENV/NAME <job_configuration>
+
+3.  Create the (identical) job at the new key. You may need to request a
+    temporary quota increase.
+
+        aurora job create CLUSTER/ROLE/ENV/NEW_NAME <job_configuration>
+
+4.  Migrate all clients over to the new job key. Update all links and
+    dashboards. Ensure that both job keys run identical versions of the
+    code while in this state.
+5.  After verifying that all clients have successfully moved over, kill
+    the old job.
+
+        aurora job killall CLUSTER/ROLE/ENV/NAME
+
+6.  If you received a temporary quota increase, be sure to let the
+    powers that be know you no longer need the additional capacity.
+
+### Restarting Jobs
+
+`restart` restarts all of a job key identified Job's shards:
+
+    aurora job restart CLUSTER/ROLE/ENV/NAME[/INSTANCES]
+
+Restarts are controlled on the client side, so aborting
+the `job restart` command halts the restart operation.
+
+**Note**: `job restart` only applies its command line arguments and does not
+use or is affected by `update.config`. Restarting
+does ***not*** involve a configuration change. To update the
+configuration, use `update.config`.
+
+The `--config` argument for restart is optional. Use it only
+if it contains hook definitions and activations that affect the
+`job restart` command.
+
+Cron Jobs
+---------
+
+You can manage cron jobs using the `aurora cron` command.  Please see
+[the Cron Jobs Feature](../../features/cron-jobs/) for more details.
+
+Comparing Jobs
+--------------
+
+    aurora job diff CLUSTER/ROLE/ENV/NAME <job configuration>
+
+Compares a job configuration against a running job. By default the diff
+is determined using `diff`, though you may choose an alternate
+ diff program by specifying the `DIFF_VIEWER` environment variable.
+
+Viewing/Examining Jobs
+----------------------
+
+Above we discussed creating, killing, and updating Jobs. Here we discuss
+how to view and examine Jobs.
+
+### Listing Jobs
+
+    aurora config list <job configuration>
+
+Lists all Jobs registered with the Aurora scheduler in the named cluster for the named role.
+
+### Inspecting a Job
+
+    aurora job inspect CLUSTER/ROLE/ENV/NAME <job configuration>
+
+`inspect` verifies that its specified job can be parsed from a
+configuration file, and displays the parsed configuration.
+
+### Checking Your Quota
+
+    aurora quota get CLUSTER/ROLE
+
+Prints the production quota allocated to the role's value at the given
+cluster. Only non-[dedicated](../../features/constraints/#dedicated-attribute)
+[production](../configuration/#job-objects) jobs consume quota.
+
+### Finding a Job on Web UI
+
+When you create a job, part of the output response contains a URL that goes
+to the job's scheduler UI page. For example:
+
+    vagrant@precise64:~$ aurora job create devcluster/www-data/prod/hello /vagrant/examples/jobs/hello_world.aurora
+    INFO] Creating job hello
+    INFO] Response from scheduler: OK (message: 1 new tasks pending for job www-data/prod/hello)
+    INFO] Job url: http://precise64:8081/scheduler/www-data/prod/hello
+
+You can go to the scheduler UI page for this job via `http://precise64:8081/scheduler/www-data/prod/hello`
+You can go to the overall scheduler UI page by going to the part of that URL that ends at `scheduler`; `http://precise64:8081/scheduler`
+
+Once you click through to a role page, you see Jobs arranged
+separately by pending jobs, active jobs and finished jobs.
+Jobs are arranged by role, typically a service account for
+production jobs and user accounts for test or development jobs.
+
+### Getting Job Status
+
+    aurora job status <job_key>
+
+Returns the status of recent tasks associated with the
+`job_key` specified Job in its supplied cluster. Typically this includes
+a mix of active tasks (running or assigned) and inactive tasks
+(successful, failed, and lost.)
+
+### Opening the Web UI
+
+Use the Job's web UI scheduler URL or the `aurora status` command to find out on which
+machines individual tasks are scheduled. You can open the web UI via the
+`open` command line command if invoked from your machine:
+
+    aurora job open [<cluster>[/<role>[/<env>/<job_name>]]]
+
+If only the cluster is specified, it goes directly to that cluster's
+scheduler main page. If the role is specified, it goes to the top-level
+role page. If the full job key is specified, it goes directly to the job
+page where you can inspect individual tasks.
+
+### SSHing to a Specific Task Machine
+
+    aurora task ssh <job_key> <shard number>
+
+You can have the Aurora client ssh directly to the machine that has been
+assigned a particular Job/shard number. This may be useful for quickly
+diagnosing issues such as performance issues or abnormal behavior on a
+particular machine.
+
+### SCPing with Specific Task Machines
+
+    aurora task scp [<cluster>/<role>/<env>/<job_name>/<instance_id>]:source [<cluster>/<role>/<env>/<job_name>/<instance_id>]:dest
+
+You can have the Aurora client copy file(s)/folder(s) to, from, and between
+individual tasks. The sandbox folder serves as the relative root and is the
+same folder you see when you browse `chroot` from the Scheduler task UI. You
+can also use absolute paths (like for `/tmp`), but tilde expansion is not
+supported. Currently, this command is only fully supported for Mesos
+containers. Users may use this to copy files from Docker containers but they
+cannot copy files to them.
+
+### Templating Command Arguments
+
+    aurora task run [-e] [-t THREADS] <job_key> -- <<command-line>>
+
+Given a job specification, run the supplied command on all hosts and
+return the output. You may use the standard Mustache templating rules:
+
+- `{{thermos.ports[name]}}` substitutes the specific named port of the
+  task assigned to this machine
+- `{{mesos.instance}}` substitutes the shard id of the job's task
+  assigned to this machine
+- `{{thermos.task_id}}` substitutes the task id of the job's task
+  assigned to this machine
+
+For example, the following type of pattern can be a powerful diagnostic
+tool:
+
+    aurora task run -t5 cluster1/tyg/devel/seizure -- \
+      'curl -s -m1 localhost:{{thermos.ports[http]}}/vars | grep uptime'
+
+By default, the command runs in the Task's sandbox. The `-e` option can
+run the command in the executor's sandbox. This is mostly useful for
+Aurora administrators.
+
+You can parallelize the runs by using the `-t` option.

Added: aurora/site/source/documentation/0.18.0/reference/client-hooks.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/reference/client-hooks.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/reference/client-hooks.md (added)
+++ aurora/site/source/documentation/0.18.0/reference/client-hooks.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,228 @@
+# Hooks for Aurora Client API
+
+You can execute hook methods around Aurora API Client methods when they are called by the Aurora Command Line commands.
+
+Explaining how hooks work is a bit tricky because of some indirection about what they apply to. Basically, a hook is code that executes when a particular Aurora Client API method runs, letting you extend the method's actions. The hook executes on the client side, specifically on the machine executing Aurora commands.
+
+The catch is that hooks are associated with Aurora Client API methods, which users don't directly call. Instead, users call Aurora Command Line commands, which call Client API methods during their execution. Since which hooks run depend on which Client API methods get called, you will need to know which Command Line commands call which API methods. Later on, there is a table showing the various associations.
+
+**Terminology Note**: From now on, "method(s)" refer to Client API methods, and "command(s)" refer to Command Line commands.
+
+- [Hook Types](#hook-types)
+- [Execution Order](#execution-order)
+- [Hookable Methods](#hookable-methods)
+- [Activating and Using Hooks](#activating-and-using-hooks)
+- [.aurora Config File Settings](#aurora-config-file-settings)
+- [Command Line](#command-line)
+- [Hooks Protocol](#hooks-protocol)
+  - [pre_ Methods](#pre_-methods)
+  - [err_ Methods](#err_-methods)
+  - [post_ Methods](#post_-methods)
+- [Generic Hooks](#generic-hooks)
+- [Hooks Process Checklist](#hooks-process-checklist)
+
+
+## Hook Types
+
+Hooks have three basic types, differing by when they run with respect to their associated method.
+
+`pre_<method_name>`: When its associated method is called, the `pre_` hook executes first, then the called method. If the `pre_` hook fails, the method never runs. Later code that expected the method to succeed may be affected by this, and result in terminating the Aurora client.
+
+Note that a `pre_` hook can error-trap internally so it does not
+return `False`. Designers/contributors of new `pre_` hooks should
+consider whether or not to error-trap them. You can error trap at the
+highest level very generally and always pass the `pre_` hook by
+returning `True`. For example:
+
+    def pre_create(...):
+      do_something()  # if do_something fails with an exception, the create_job is not attempted!
+      return True
+
+    # However...
+    def pre_create(...):
+      try:
+        do_something()  # may cause exception
+      except Exception:  # generic error trap will catch it
+        pass  # and ignore the exception
+      return True  # create_job will run in any case!
+
+`post_<method_name>`: A `post_` hook executes after its associated method successfully finishes running. If it fails, the already executed method is unaffected. A `post_` hook's error is trapped, and any later operations are unaffected.
+
+`err_<method_name>`: Executes only when its associated method returns a status other than OK or throws an exception. If an `err_` hook fails, the already executed method is unaffected. An `err_` hook's error is trapped, and any later operations are unaffected.
+
+## Execution Order
+
+A command with `pre_`, `post_`, and `err_` hooks defined and activated for its called method executes in the following order when the method successfully executes:
+
+1. Command called
+2. Command code executes
+3. Method Called
+4. `pre_` method hook runs
+5. Method runs and successfully finishes
+6. `post_` method hook runs
+7. Command code executes
+8. Command execution ends
+
+The following is what happens when, for the same command and hooks, the method associated with the command suffers an error and does not successfully finish executing:
+
+1. Command called
+2. Command code executes
+3. Method Called
+4. `pre_` method hook runs
+5. Method runs and fails
+6. `err_` method hook runs
+7. Command Code executes (if `err_` method does not end the command execution)
+8. Command execution ends
+
+Note that the `post_` and `err_` hooks for the same method can never both run for a single execution of that method.
+
+## Hookable Methods
+
+You can associate `pre_`, `post_`, and `err_` hooks with the following methods. Since you do not directly interact with the methods, but rather the Aurora Command Line commands that call them, for each method we also list the command(s) that can call the method. Note that a different method or methods may be called by a command depending on how the command's other code executes. Similarly, multiple commands can call the same method. We also list the methods' argument signatures, which are used by their associated hooks. <a name="Chart"></a>
+
+  Aurora Client API Method | Client API Method Argument Signature | Aurora Command Line Command
+  -------------------------| ------------------------------------- | ---------------------------
+  ```create_job``` | ```self```, ```config``` | ```job create```, <code>runtask
+  ```restart``` | ```self```, ```job_key```, ```shards```, ```update_config```, ```health_check_interval_seconds``` | ```job restart```
+  ```kill_job``` | ```self```, ```job_key```, ```shards=None``` |  ```job kill```
+  ```start_cronjob``` | ```self```, ```job_key``` | ```cron start```
+  ```start_job_update``` | ```self```, ```config```, ```instances=None``` | ```update start```
+
+Some specific examples:
+
+* `pre_create_job` executes when a `create_job` method is called, and before the `create_job` method itself executes.
+
+* `post_cancel_update` executes after a `cancel_update` method has successfully finished running.
+
+* `err_kill_job` executes when the `kill_job` method is called, but doesn't successfully finish running.
+
+## Activating and Using Hooks
+
+By default, hooks are inactive. If you do not want to use hooks, you do not need to make any changes to your code. If you do want to use hooks, you will need to alter your `.aurora` config file to activate them both for the configuration as a whole as well as for individual `Job`s. And, of course, you will need to define in your config file what happens when a particular hook executes.
+
+## .aurora Config File Settings
+
+You can define a top-level `hooks` variable in any `.aurora` config file. `hooks` is a list of all objects that define hooks used by `Job`s defined in that config file. If you do not want to define any hooks for a configuration, `hooks` is optional.
+
+    hooks = [Object_with_defined_hooks1, Object_with_defined_hooks2]
+
+Be careful when assembling a config file using `include` on multiple smaller config files. If there are multiple files that assign a value to `hooks`, only the last assignment made will stick. For example, if `x.aurora` has `hooks = [a, b, c]` and `y.aurora` has `hooks = [d, e, f]` and `z.aurora` has, in this order, `include x.aurora` and `include y.aurora`, the `hooks` value will be `[d, e, f]`.
+
+Also, for any `Job` that you want to use hooks with, its `Job` definition in the `.aurora` config file must set an `enable_hooks` flag to `True` (it defaults to `False`). By default, hooks are disabled and you must enable them for `Job`s of your choice.
+
+To summarize, to use hooks for a particular job, you must both activate hooks for your config file as a whole, and for that job. Activating hooks only for individual jobs won't work, nor will only activating hooks for your config file as a whole. You must also specify the hooks' defining object in the `hooks` variable.
+
+Recall that `.aurora` config files are written in Pystachio. So the following turns on hooks for production jobs at cluster1 and cluster2, but leaves them off for similar jobs with a defined user role. Of course, you also need to list the objects that define the hooks in your config file's `hooks` variable.
+
+    jobs = [
+            Job(enable_hooks = True, cluster = c, env = 'prod') for c in ('cluster1', 'cluster2')
+           ]
+    jobs.extend(
+       Job(cluster = c, env = 'prod', role = getpass.getuser()) for c in ('cluster1', 'cluster2'))
+       # Hooks disabled for these jobs
+
+## Command Line
+
+All Aurora Command Line commands now accept an `.aurora` config file as an optional parameter (some, of course, accept it as a required parameter). Whenever a command has a `.aurora` file parameter, any hooks specified and activated in the `.aurora` file can be used. For example:
+
+    aurora job restart cluster1/role/env/app myapp.aurora
+
+The command activates any hooks specified and activated in `myapp.aurora`. For the `restart` command, that is the only thing the `myapp.aurora` parameter does. So, if the command was the following, since there is no `.aurora` config file to specify any hooks, no hooks on the `restart` command can run.
+
+    aurora job restart cluster1/role/env/app
+
+## Hooks Protocol
+
+Any object defined in the `.aurora` config file can define hook methods. You should define your hook methods within a class, and then use the class name as a value in the `hooks` list in your config file.
+
+Note that you can define other methods in the class that its hook methods can call; all the logic of a hook does not have to be in its definition.
+
+The following example defines a class containing a `pre_kill_job` hook definition that calls another method defined in the class.
+
+    # Defines a method pre_kill_job
+    class KillConfirmer(object):
+      def confirm(self, msg):
+        return raw_input(msg).lower() == 'yes'
+
+      def pre_kill_job(self, job_key, shards=None):
+        shards = ('shards %s' % shards) if shards is not None else 'all shards'
+        return self.confirm('Are you sure you want to kill %s (%s)? (yes/no): '
+                            % (job_key, shards))
+
+### pre_ Methods
+
+`pre_` methods have the signature:
+
+    pre_<API method name>(self, <associated method's signature>)
+
+`pre_` methods have the same signature as their associated method, with the addition of `self` as the first parameter. See the [chart](#Chart) above for the mapping of parameters to methods. When writing `pre_` methods, you can use the `*` and `**` syntax to designate that all unspecified parameters are passed in a list to the `*`ed variable and all named parameters with values are passed as name/value pairs to the `**`ed variable.
+
+If this method returns False, the API command call aborts.
+
+### err_ Methods
+
+`err_` methods have the signature:
+
+    err_<API method name>(self, exc, <associated method's signature>)
+
+`err_` methods have the same signature as their associated method, with the addition of a first parameter `self` and a second parameter `exc`. `exc` is either a result with responseCode other than `ResponseCode.OK` or an `Exception`. See the [chart](#Chart) above for the mapping of parameters to methods. When writing `err`_ methods, you can use the `*` and `**` syntax to designate that all unspecified parameters are passed in a list to the `*`ed variable and all named parameters with values are passed as name/value pairs to the `**`ed variable.
+
+`err_` method return codes are ignored.
+
+### post_ Methods
+
+`post_` methods have the signature:
+
+    post_<API method name>(self, result, <associated method signature>)
+
+`post_` method parameters are `self`, then `result`, followed by the same parameter signature as their associated method. `result` is the result of the associated method call. See the [chart](#chart) above for the mapping of parameters to methods. When writing `post_` methods, you can use the `*` and `**` syntax to designate that all unspecified arguments are passed in a list to the `*`ed parameter and all unspecified named arguments with values are passed as name/value pairs to the `**`ed parameter.
+
+`post_` method return codes are ignored.
+
+## Generic Hooks
+
+There are seven Aurora API Methods which any of the three hook types can attach to. Thus, there are 21 possible hook/method combinations for a single `.aurora` config file. Say that you define `pre_` and `post_` hooks for the `restart` method. That leaves 19 undefined hook/method combinations; `err_restart` and the 3 `pre_`, `post_`, and `err_` hooks for each of the other 6 hookable methods. You can define what happens when any of these otherwise undefined 19 hooks execute via a generic hook, whose signature is:
+
+    generic_hook(self, hook_config, event, method_name, result_or_err, args*, kw**)
+
+where:
+
+* `hook_config` is a named tuple of `config` (the Pystashio `config` object) and `job_key`.
+
+* `event` is one of `pre`, `err`, or `post`, indicating which type of hook the genetic hook is standing in for. For example, assume no specific hooks were defined for the `restart` API command. If `generic_hook` is defined and activated, and `restart` is called, `generic_hook` will effectively run as `pre_restart`, `post_restart`, and `err_restart`. You can use a selection statement on this value so that `generic_hook` will act differently based on whether it is standing in for a `pre_`, `post_`, or `err_` hook.
+
+* `method_name` is the Client API method name whose execution is causing this execution of the `generic_hook`.
+
+* `args*`, `kw**` are the API method arguments and keyword arguments respectively.
+* `result_or_err` is a tri-state parameter taking one of these three values:
+  1. None for `pre_`hooks
+  2. `result` for `post_` nooks
+  3. `exc` for `err_` hooks
+
+Example:
+
+    # Overrides the standard do-nothing generic_hook by adding a log writing operation.
+    from twitter.common import log
+      class Logger(object):
+        '''Adds to the log every time a hookable API method is called'''
+        def generic_hook(self, hook_config, event, method_name, result_or_err, *args, **kw)
+          log.info('%s: %s_%s of %s'
+                   % (self.__class__.__name__, event, method_name, hook_config.job_key))
+
+## Hooks Process Checklist
+
+1. In your `.aurora` config file, add a `hooks` variable. Note that you may want to define a `.aurora` file only for hook definitions and then include this file in multiple other config files that you want to use the same hooks.
+
+    hooks = []
+
+2. In the `hooks` variable, list all objects that define hooks used by `Job`s defined in this config:
+
+    hooks = [Object_hook_definer1, Object_hook_definer2]
+
+3. For each job that uses hooks in this config file, add `enable_hooks = True` to the `Job` definition. Note that this is necessary even if you only want to use the generic hook.
+
+4. Write your `pre_`, `post_`, and `err_` hook definitions as part of an object definition in your `.aurora` config file.
+
+5. If desired, write your `generic_hook` definition as part of an object definition in your `.aurora` config file. Remember, the object must be listed as a member of `hooks`.
+
+6. If your Aurora command line command does not otherwise take an `.aurora` config file argument, add the appropriate `.aurora` file as an argument in order to define and activate the configuration's hooks.

Added: aurora/site/source/documentation/0.18.0/reference/configuration-best-practices.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.18.0/reference/configuration-best-practices.md?rev=1799392&view=auto
==============================================================================
--- aurora/site/source/documentation/0.18.0/reference/configuration-best-practices.md (added)
+++ aurora/site/source/documentation/0.18.0/reference/configuration-best-practices.md Wed Jun 21 06:36:21 2017
@@ -0,0 +1,187 @@
+Aurora Configuration Best Practices
+===================================
+
+Use As Few .aurora Files As Possible
+------------------------------------
+
+When creating your `.aurora` configuration, try to keep all versions of
+a particular job within the same `.aurora` file. For example, if you
+have separate jobs for `cluster1`, `cluster1` staging, `cluster1`
+testing, and`cluster2`, keep them as close together as possible.
+
+Constructs shared across multiple jobs owned by your team (e.g.
+team-level defaults or structural templates) can be split into separate
+`.aurora`files and included via the `include` directive.
+
+
+Avoid Boilerplate
+------------------
+
+If you see repetition or find yourself copy and pasting any parts of
+your configuration, it's likely an opportunity for templating. Take the
+example below:
+
+`redundant.aurora` contains:
+
+    download = Process(
+      name = 'download',
+      cmdline = 'wget http://www.python.org/ftp/python/2.7.3/Python-2.7.3.tar.bz2',
+      max_failures = 5,
+      min_duration = 1)
+
+    unpack = Process(
+      name = 'unpack',
+      cmdline = 'rm -rf Python-2.7.3 && tar xzf Python-2.7.3.tar.bz2',
+      max_failures = 5,
+      min_duration = 1)
+
+    build = Process(
+      name = 'build',
+      cmdline = 'pushd Python-2.7.3 && ./configure && make && popd',
+      max_failures = 1)
+
+    email = Process(
+      name = 'email',
+      cmdline = 'echo Success | mail feynman@tmc.com',
+      max_failures = 5,
+      min_duration = 1)
+
+    build_python = Task(
+      name = 'build_python',
+      processes = [download, unpack, build, email],
+      constraints = [Constraint(order = ['download', 'unpack', 'build', 'email'])])
+
+As you'll notice, there's a lot of repetition in the `Process`
+definitions. For example, almost every process sets a `max_failures`
+limit to 5 and a `min_duration` to 1. This is an opportunity for factoring
+into a common process template.
+
+Furthermore, the Python version is repeated everywhere. This can be
+bound via structural templating as described in the [Advanced Binding](../configuration-templating/#AdvancedBinding)
+section.
+
+`less_redundant.aurora` contains:
+
+    class Python(Struct):
+      version = Required(String)
+      base = Default(String, 'Python-{{version}}')
+      package = Default(String, '{{base}}.tar.bz2')
+
+    ReliableProcess = Process(
+      max_failures = 5,
+      min_duration = 1)
+
+    download = ReliableProcess(
+      name = 'download',
+      cmdline = 'wget http://www.python.org/ftp/python/{{python.version}}/{{python.package}}')
+
+    unpack = ReliableProcess(
+      name = 'unpack',
+      cmdline = 'rm -rf {{python.base}} && tar xzf {{python.package}}')
+
+    build = ReliableProcess(
+      name = 'build',
+      cmdline = 'pushd {{python.base}} && ./configure && make && popd',
+      max_failures = 1)
+
+    email = ReliableProcess(
+      name = 'email',
+      cmdline = 'echo Success | mail {{role}}@foocorp.com')
+
+    build_python = SequentialTask(
+      name = 'build_python',
+      processes = [download, unpack, build, email]).bind(python = Python(version = "2.7.3"))
+
+
+Thermos Uses bash, But Thermos Is Not bash
+-------------------------------------------
+
+#### Bad
+
+Many tiny Processes makes for harder to manage configurations.
+
+    copy = Process(
+      name = 'copy',
+      cmdline = 'rcp user@my_machine:my_application .'
+     )
+
+     unpack = Process(
+       name = 'unpack',
+       cmdline = 'unzip app.zip'
+     )
+
+     remove = Process(
+       name = 'remove',
+       cmdline = 'rm -f app.zip'
+     )
+
+     run = Process(
+       name = 'app',
+       cmdline = 'java -jar app.jar'
+     )
+
+     run_task = Task(
+       processes = [copy, unpack, remove, run],
+       constraints = order(copy, unpack, remove, run)
+     )
+
+#### Good
+
+Each `cmdline` runs in a bash subshell, so you have the full power of
+bash. Chaining commands with `&&` or `||` is almost always the right
+thing to do.
+
+Also for Tasks that are simply a list of processes that run one after
+another, consider using the `SequentialTask` helper which applies a
+linear ordering constraint for you.
+
+    stage = Process(
+      name = 'stage',
+      cmdline = 'rcp user@my_machine:my_application . && unzip app.zip && rm -f app.zip')
+
+    run = Process(name = 'app', cmdline = 'java -jar app.jar')
+
+    run_task = SequentialTask(processes = [stage, run])
+
+
+Rarely Use Functions In Your Configurations
+-------------------------------------------
+
+90% of the time you define a function in a `.aurora` file, you're
+probably Doing It Wrong(TM).
+
+#### Bad
+
+    def get_my_task(name, user, cpu, ram, disk):
+      return Task(
+        name = name,
+        user = user,
+        processes = [STAGE_PROCESS, RUN_PROCESS],
+        constraints = order(STAGE_PROCESS, RUN_PROCESS),
+        resources = Resources(cpu = cpu, ram = ram, disk = disk)
+     )
+
+     task_one = get_my_task('task_one', 'feynman', 1.0, 32*MB, 1*GB)
+     task_two = get_my_task('task_two', 'feynman', 2.0, 64*MB, 1*GB)
+
+#### Good
+
+This one is more idiomatic. Forced keyword arguments prevents accidents,
+e.g. constructing a task with "32*MB" when you mean 32MB of ram and not
+disk. Less proliferation of task-construction techniques means
+easier-to-read, quicker-to-understand, and a more composable
+configuration.
+
+    TASK_TEMPLATE = SequentialTask(
+      user = 'wickman',
+      processes = [STAGE_PROCESS, RUN_PROCESS],
+    )
+
+    task_one = TASK_TEMPLATE(
+      name = 'task_one',
+      resources = Resources(cpu = 1.0, ram = 32*MB, disk = 1*GB) )
+
+    task_two = TASK_TEMPLATE(
+      name = 'task_two',
+      resources = Resources(cpu = 2.0, ram = 64*MB, disk = 1*GB)
+    )



Mime
View raw message