aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1721584 [12/12] - in /aurora/site: data/ publish/ publish/blog/ publish/blog/aurora-0-11-0-released/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publis...
Date Wed, 23 Dec 2015 22:45:24 GMT
Added: aurora/site/source/documentation/0.11.0/
--- aurora/site/source/documentation/0.11.0/ (added)
+++ aurora/site/source/documentation/0.11.0/ Wed Dec 23 22:45:21 2015
@@ -0,0 +1,355 @@
+Aurora User Guide
+- [Overview](#overview)
+- [Job Lifecycle](#job-lifecycle)
+	- [Life Of A Task](#life-of-a-task)
+	- [PENDING to RUNNING states](#pending-to-running-states)
+	- [Task Updates](#task-updates)
+	- [HTTP Health Checking and Graceful Shutdown](#http-health-checking-and-graceful-shutdown)
+		- [Tearing a task down](#tearing-a-task-down)
+	- [Giving Priority to Production Tasks: PREEMPTING](#giving-priority-to-production-tasks-preempting)
+	- [Natural Termination: FINISHED, FAILED](#natural-termination-finished-failed)
+	- [Forceful Termination: KILLING, RESTARTING](#forceful-termination-killing-restarting)
+- [Service Discovery](#service-discovery)
+- [Configuration](#configuration)
+- [Creating Jobs](#creating-jobs)
+- [Interacting With Jobs](#interacting-with-jobs)
+This document gives an overview of how Aurora works under the hood.
+It assumes you've already worked through the "hello world" example
+job in the [Aurora Tutorial](/documentation/0.11.0/tutorial/). Specifics of how to use Aurora
are **not**
+ given here, but pointers to documentation about how to use Aurora are
+Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos
+cares about individual *tasks*, but typical jobs consist of dozens or
+hundreds of task replicas. Aurora provides a layer on top of Mesos with
+its `Job` abstraction. An Aurora `Job` consists of a task template and
+instructions for creating near-identical replicas of that task (modulo
+things like "instance id" or specific port numbers which may differ from
+machine to machine).
+How many tasks make up a Job is complicated. On a basic level, a Job consists of
+one task template and instructions for creating near-idential replicas of that task
+(otherwise referred to as "instances" or "shards").
+However, since Jobs can be updated on the fly, a single Job identifier or *job key*
+can have multiple job configurations associated with it.
+For example, consider when I have a Job with 4 instances that each
+request 1 core of cpu, 1 GB of RAM, and 1 GB of disk space as specified
+in the configuration file `hello_world.aurora`. I want to
+update it so it requests 2 GB of RAM instead of 1. I create a new
+configuration file to do that called `new_hello_world.aurora` and
+issue a `aurora update start <job_key_value>/0-1 new_hello_world.aurora`
+This results in instances 0 and 1 having 1 cpu, 2 GB of RAM, and 1 GB of disk space,
+while instances 2 and 3 have 1 cpu, 1 GB of RAM, and 1 GB of disk space. If instance 3
+dies and restarts, it restarts with 1 cpu, 1 GB RAM, and 1 GB disk space.
+So that means there are two simultaneous task configurations for the same Job
+at the same time, just valid for different ranges of instances.
+This isn't a recommended pattern, but it is valid and supported by the
+Aurora scheduler. This most often manifests in the "canary pattern" where
+instance 0 runs with a different configuration than instances 1-N to test
+different code versions alongside the actual production job.
+A task can merely be a single *process* corresponding to a single
+command line, such as `python2.6`. However, a task can also
+consist of many separate processes, which all run within a single
+sandbox. For example, running multiple cooperating agents together,
+such as `logrotate`, `installer`, master, or slave processes. This is
+where Thermos  comes in. While Aurora provides a `Job` abstraction on
+top of Mesos `Tasks`, Thermos provides a `Process` abstraction
+underneath Mesos `Task`s and serves as part of the Aurora framework's
+You define `Job`s,` Task`s, and `Process`es in a configuration file.
+Configuration files are written in Python, and make use of the Pystachio
+templating language. They end in a `.aurora` extension.
+Pystachio is a type-checked dictionary templating library.
+> TL;DR
+> -   Aurora manages jobs made of tasks.
+> -   Mesos manages tasks made of processes.
+> -   Thermos manages processes.
+> -   All defined in `.aurora` configuration file.
+![Aurora hierarchy](images/aurora_hierarchy.png)
+Each `Task` has a *sandbox* created when the `Task` starts and garbage
+collected when it finishes. All of a `Task'`s processes run in its
+sandbox, so processes can share state by using a shared current working
+The sandbox garbage collection policy considers many factors, most
+importantly age and size. It makes a best-effort attempt to keep
+sandboxes around as long as possible post-task in order for service
+owners to inspect data and logs, should the `Task` have completed
+abnormally. But you can't design your applications assuming sandboxes
+will be around forever, e.g. by building log saving or other
+checkpointing mechanisms directly into your application or into your
+`Job` description.
+Job Lifecycle
+When Aurora reads a configuration file and finds a `Job` definition, it:
+1.  Evaluates the `Job` definition.
+2.  Splits the `Job` into its constituent `Task`s.
+3.  Sends those `Task`s to the scheduler.
+4.  The scheduler puts the `Task`s into `PENDING` state, starting each
+    `Task`'s life cycle.
+### Life Of A Task
+![Life of a task](images/lifeofatask.png)
+### PENDING to RUNNING states
+When a `Task` is in the `PENDING` state, the scheduler constantly
+searches for machines satisfying that `Task`'s resource request
+requirements (RAM, disk space, CPU time) while maintaining configuration
+constraints such as "a `Task` must run on machines  dedicated  to a
+particular role" or attribute limit constraints such as "at most 2
+`Task`s from the same `Job` may run on each rack". When the scheduler
+finds a suitable match, it assigns the `Task` to a machine and puts the
+`Task` into the `ASSIGNED` state.
+From the `ASSIGNED` state, the scheduler sends an RPC to the slave
+machine containing `Task` configuration, which the slave uses to spawn
+an executor responsible for the `Task`'s lifecycle. When the scheduler
+receives an acknowledgement that the machine has accepted the `Task`,
+the `Task` goes into `STARTING` state.
+`STARTING` state initializes a `Task` sandbox. When the sandbox is fully
+initialized, Thermos begins to invoke `Process`es. Also, the slave
+machine sends an update to the scheduler that the `Task` is
+in `RUNNING` state.
+If a `Task` stays in `ASSIGNED` or `STARTING` for too long, the
+scheduler forces it into `LOST` state, creating a new `Task` in its
+place that's sent into `PENDING` state. This is technically true of any
+active state: if the Mesos core tells the scheduler that a slave has
+become unhealthy (or outright disappeared), the `Task`s assigned to that
+slave go into `LOST` state and new `Task`s are created in their place.
+From `PENDING` state, there is no guarantee a `Task` will be reassigned
+to the same machine unless job constraints explicitly force it there.
+If there is a state mismatch, (e.g. a machine returns from a `netsplit`
+and the scheduler has marked all its `Task`s `LOST` and rescheduled
+them), a state reconciliation process kills the errant `RUNNING` tasks,
+which may take up to an hour. But to emphasize this point: there is no
+uniqueness guarantee for a single instance of a job in the presence of
+network partitions. If the Task requires that, it should be baked in at
+the application level using a distributed coordination service such as
+### Task Updates
+`Job` configurations can be updated at any point in their lifecycle.
+Usually updates are done incrementally using a process called a *rolling
+upgrade*, in which Tasks are upgraded in small groups, one group at a
+time.  Updates are done using various Aurora Client commands.
+For a configuration update, the Aurora Client calculates required changes
+by examining the current job config state and the new desired job config.
+It then starts a rolling batched update process by going through every batch
+and performing these operations:
+- If an instance is present in the scheduler but isn't in the new config,
+  then that instance is killed.
+- If an instance is not present in the scheduler but is present in
+  the new config, then the instance is created.
+- If an instance is present in both the scheduler the new config, then
+  the client diffs both task configs. If it detects any changes, it
+  performs an instance update by killing the old config instance and adds
+  the new config instance.
+The Aurora client continues through the instance list until all tasks are
+updated, in `RUNNING,` and healthy for a configurable amount of time.
+If the client determines the update is not going well (a percentage of health
+checks have failed), it cancels the update.
+Update cancellation runs a procedure similar to the described above
+update sequence, but in reverse order. New instance configs are swapped
+with old instance configs and batch updates proceed backwards
+from the point where the update failed. E.g.; (0,1,2) (3,4,5) (6,7,
+8-FAIL) results in a rollback in order (8,7,6) (5,4,3) (2,1,0).
+### HTTP Health Checking and Graceful Shutdown
+The Executor implements a protocol for rudimentary control of a task via HTTP.  Tasks subscribe
+this protocol by declaring a port named `health`.  Take for example this configuration snippet:
+    nginx = Process(
+      name = 'nginx',
+      cmdline = './ -port {{thermos.ports[http]}}')
+When this Process is included in a job, the job will be allocated a port, and the command
+will be replaced with something like:
+    ./ -port 42816
+Where 42816 happens to be the allocated. port.  Typically, the Executor monitors Processes
+a task only by liveness of the forked process.  However, when a `health` port was allocated,
it will
+also send periodic HTTP health checks.  A task requesting a `health` port must handle the
+| HTTP request            | Description                             |
+| ------------            | -----------                             |
+| `GET /health`           | Inquires whether the task is healthy.   |
+| `POST /quitquitquit`    | Task should initiate graceful shutdown. |
+| `POST /abortabortabort` | Final warning task is being killed.     |
+Please see the
+[configuration reference](/documentation/0.11.0/configuration-reference/#healthcheckconfig-objects)
+configuration options for this feature.
+#### Snoozing Health Checks
+If you need to pause your health check, you can do so by touching a file inside of your sandbox,
+named `.healthchecksnooze`
+As long as that file is present, health checks will be disabled, enabling users to gather
core dumps
+or other performance measurements without worrying about Aurora's health check killing their
+WARNING: Remember to remove this when you are done, otherwise your instance will have permanently
+disabled health checks.
+#### Tearing a task down
+The Executor follows an escalation sequence when killing a running task:
+  1. If `health` port is not present, skip to (5)
+  2. POST /quitquitquit
+  3. wait 5 seconds
+  4. POST /abortabortabort
+  5. Send SIGTERM (`kill`)
+  6. Send SIGKILL (`kill -9`)
+If the Executor notices that all Processes in a Task have aborted during this sequence, it
+not proceed with subsequent steps.  Note that graceful shutdown is best-effort, and due to
the many
+inevitable realities of distributed systems, it may not be performed.
+### Giving Priority to Production Tasks: PREEMPTING
+Sometimes a Task needs to be interrupted, such as when a non-production
+Task's resources are needed by a higher priority production Task. This
+type of interruption is called a *pre-emption*. When this happens in
+Aurora, the non-production Task is killed and moved into
+the `PREEMPTING` state  when both the following are true:
+- The task being killed is a non-production task.
+- The other task is a `PENDING` production task that hasn't been
+  scheduled due to a lack of resources.
+Since production tasks are much more important, Aurora kills off the
+non-production task to free up resources for the production task. The
+scheduler UI shows the non-production task was preempted in favor of the
+production task. At some point, tasks in `PREEMPTING` move to `KILLED`.
+Note that non-production tasks consuming many resources are likely to be
+preempted in favor of production tasks.
+### Natural Termination: FINISHED, FAILED
+A `RUNNING` `Task` can terminate without direct user interaction. For
+example, it may be a finite computation that finishes, even something as
+simple as `echo hello world. `Or it could be an exceptional condition in
+a long-lived service. If the `Task` is successful (its underlying
+processes have succeeded with exit status `0` or finished without
+reaching failure limits) it moves into `FINISHED` state. If it finished
+after reaching a set of failure limits, it goes into `FAILED` state.
+### Forceful Termination: KILLING, RESTARTING
+You can terminate a `Task` by issuing an `aurora job kill` command, which
+moves it into `KILLING` state. The scheduler then sends the slave  a
+request to terminate the `Task`. If the scheduler receives a successful
+response, it moves the Task into `KILLED` state and never restarts it.
+The scheduler has access to a non-public `RESTARTING` state. If a `Task`
+is forced into the `RESTARTING` state, the scheduler kills the
+underlying task but in parallel schedules an identical replacement for
+You define and configure your Jobs (and their Tasks and Processes) in
+Aurora configuration files. Their filenames end with the `.aurora`
+suffix, and you write them in Python making use of the Pystachio
+templating language, along
+with specific Aurora, Mesos, and Thermos commands and methods. See the
+[Configuration Guide and Reference](/documentation/0.11.0/configuration-reference/) and
+[Configuration Tutorial](/documentation/0.11.0/configuration-tutorial/).
+Service Discovery
+It is possible for the Aurora executor to announce tasks into ServerSets for
+the purpose of service discovery.  ServerSets use the Zookeeper [group membership pattern](
+of which there are several reference implementations:
+  - [C++](
+  - [Java](
+  - [Python](
+These can also be used natively in Finagle using the [ZookeeperServerSetCluster](
+For more information about how to configure announcing, see the [Configuration Reference](/documentation/0.11.0/configuration-reference/).
+Creating Jobs
+You create and manipulate Aurora Jobs with the Aurora client, which starts all its
+command line commands with
+`aurora`. See [Aurora Client Commands](/documentation/0.11.0/client-commands/) for details
+about the Aurora Client.
+Interacting With Jobs
+You interact with Aurora jobs either via:
+- Read-only Web UIs
+  Part of the output from creating a new Job is a URL for the Job's scheduler UI page.
+  For example:
+      vagrant@precise64:~$ aurora job create devcluster/www-data/prod/hello \
+      /vagrant/examples/jobs/hello_world.aurora
+      INFO] Creating job hello
+      INFO] Response from scheduler: OK (message: 1 new tasks pending for job www-data/prod/hello)
+      INFO] Job url: http://precise64:8081/scheduler/www-data/prod/hello
+  The "Job url" goes to the Job's scheduler UI page. To go to the overall scheduler UI page,
+  stop at the "scheduler" part of the URL, in this case, `http://precise64:8081/scheduler`
+  You can also reach the scheduler UI page via the Client command `aurora job open`:
+      aurora job open [<cluster>[/<role>[/<env>/<job_name>]]]
+  If only the cluster is specified, it goes directly to that cluster's scheduler main page.
+  If the role is specified, it goes to the top-level role page. If the full job key is specified,
+  it goes directly to the job page where you can inspect individual tasks.
+  Once you click through to a role page, you see Jobs arranged separately by pending jobs,
+  jobs, and finished jobs. Jobs are arranged by role, typically a service account for production
+  jobs and user accounts for test or development jobs.
+- The Aurora client
+  See [client commands](/documentation/0.11.0/client-commands/).

Added: aurora/site/source/documentation/0.11.0/
--- aurora/site/source/documentation/0.11.0/ (added)
+++ aurora/site/source/documentation/0.11.0/ Wed Dec 23 22:45:21 2015
@@ -0,0 +1,137 @@
+Getting Started
+This document shows you how to configure a complete cluster using a virtual machine. This
+replicates a real cluster in your development machine as closely as possible. After you complete
+the steps outlined here, you will be ready to create and run your first Aurora job.
+The following sections describe these steps in detail:
+1. [Overview](#overview)
+1. [Install VirtualBox and Vagrant](#install-virtualbox-and-vagrant)
+1. [Clone the Aurora repository](#clone-the-aurora-repository)
+1. [Start the local cluster](#start-the-local-cluster)
+1. [Log onto the VM](#log-onto-the-vm)
+1. [Run your first job](#run-your-first-job)
+1. [Rebuild components](#rebuild-components)
+1. [Shut down or delete your local cluster](#shut-down-or-delete-your-local-cluster)
+1. [Troubleshooting](#troubleshooting)
+The Aurora distribution includes a set of scripts that enable you to create a local cluster
+your development machine. These scripts use [Vagrant]( and
+[VirtualBox]( to run and configure a virtual machine. Once the
+virtual machine is running, the scripts install and initialize Aurora and any required components
+to create the local cluster.
+Install VirtualBox and Vagrant
+First, download and install [VirtualBox]( on your development
+Then download and install [Vagrant]( To verify that the installation
+was successful, open a terminal window and type the `vagrant` command. You should see a list
+common commands for this tool.
+Clone the Aurora repository
+To obtain the Aurora source distribution, clone its Git repository using the following command:
+     git clone git://
+Start the local cluster
+Now change into the `aurora/` directory, which contains the Aurora source code and
+other scripts and tools:
+     cd aurora/
+To start the local cluster, type the following command:
+     vagrant up
+This command uses the configuration scripts in the Aurora distribution to:
+* Download a Linux system image.
+* Start a virtual machine (VM) and configure it.
+* Install the required build tools on the VM.
+* Install Aurora's requirements (like [Mesos]( and
+[Zookeeper]( on the VM.
+* Build and install Aurora from source on the VM.
+* Start Aurora's services on the VM.
+This process takes several minutes to complete.
+To verify that Aurora is running on the cluster, visit the following URLs:
+* Scheduler -
+* Observer -
+* Mesos Master -
+* Mesos Slave -
+Log onto the VM
+To SSH into the VM, run the following command in your development machine:
+     vagrant ssh
+To verify that Aurora is installed in the VM, type the `aurora` command. You should see a
+of arguments and possible commands.
+The `/vagrant` directory on the VM is mapped to the `aurora/` local directory
+from which you started the cluster. You can edit files inside this directory in your development
+machine and access them from the VM under `/vagrant`.
+A pre-installed `clusters.json` file refers to your local cluster as `devcluster`, which
+will use in client commands.
+Run your first job
+Now that your cluster is up and running, you are ready to define and run your first job in
+For more information, see the [Aurora Tutorial](/documentation/0.11.0/tutorial/).
+Rebuild components
+If you are changing Aurora code and would like to rebuild a component, you can use the `aurorabuild`
+command on the VM to build and restart a component.  This is considerably faster than destroying
+and rebuilding your VM.
+`aurorabuild` accepts a list of components to build and update. To get a list of supported
+components, invoke the `aurorabuild` command with no arguments:
+     vagrant ssh -c 'aurorabuild client'
+Shut down or delete your local cluster
+To shut down your local cluster, run the `vagrant halt` command in your development machine.
+start it again, run the `vagrant up` command.
+Once you are finished with your local cluster, or if you would otherwise like to start from
+you can use the command `vagrant destroy` to turn off and delete the virtual file system.
+Most of the vagrant related problems can be fixed by the following steps:
+* Destroying the vagrant environment with `vagrant destroy`
+* Killing any orphaned VMs (see AURORA-499) with `virtualbox` UI or `VBoxManage` command
line tool
+* Cleaning the repository of build artifacts and other intermediate output with `git clean
+* Bringing up the vagrant environment with `vagrant up`

Modified: aurora/site/source/documentation/latest/
--- aurora/site/source/documentation/latest/ (original)
+++ aurora/site/source/documentation/latest/ Wed Dec 23 22:45:21
@@ -152,6 +152,8 @@ to ZooKeeper) or explicitly set in the s
     # ...
 ### Considerations for running jobs in docker containers
+*Note: Docker support is currently EXPERIMENTAL.*
 In order for Aurora to launch jobs using docker containers, a few extra configuration options
 must be set.  The [docker containerizer](
 must be enabled on the mesos slaves by launching them with the `--containerizers=docker,mesos`

Modified: aurora/site/source/documentation/latest/
--- aurora/site/source/documentation/latest/ (original)
+++ aurora/site/source/documentation/latest/ Wed Dec 23 22:45:21 2015
@@ -17,7 +17,7 @@ We encourage you to ask questions on the
  * [Cron Jobs](/documentation/latest/cron-jobs/)
 ## Operators
- * [Install Aurora](/documentation/latest/installing/)
+ * [Deploy Aurora](/documentation/latest/deploying-aurora-scheduler/)
  * [Monitoring](/documentation/latest/monitoring/)
  * [Hooks for Aurora Client API](/documentation/latest/hooks/)
  * [Scheduler Storage](/documentation/latest/storage/)

View raw message