aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject svn commit: r1581246 [6/6] - in /incubator/aurora/site: ./ publish/ publish/community/ publish/developers/ publish/docs/gettingstarted/ publish/docs/howtocontribute/ publish/documentation/ publish/documentation/latest/ publish/documentation/latest/clie...
Date Tue, 25 Mar 2014 06:10:07 GMT
Added: incubator/aurora/site/source/documentation/latest/
--- incubator/aurora/site/source/documentation/latest/ (added)
+++ incubator/aurora/site/source/documentation/latest/ Tue Mar 25 06:10:05 2014
@@ -0,0 +1,261 @@
+Aurora Tutorial
+Before reading this document, you should read over the (short) [README](/documentation/latest/README/)
+for the Aurora docs.
+- [Introduction](#introduction)
+- [Setup: Install Aurora](#setup-install-aurora)
+- [The Script](#the-script)
+- [Aurora Configuration](#aurora-configuration)
+- [What's Going On In That Configuration File?](#whats-going-on-in-that-configuration-file)
+- [Creating the Job](#creating-the-job)
+- [Watching the Job Run](#watching-the-job-run)
+- [Cleanup](#cleanup)
+- [Next Steps](#next-steps)
+## Introduction
+This tutorial shows how to use the Aurora scheduler to run (and
+"`printf-debug`") a hello world program on Mesos. The operational
+hierarchy is:
+- Aurora manages and schedules jobs for Mesos to run.
+- Mesos manages the individual tasks that make up a job.
+- Thermos manages the individual processes that make up a task.
+This is the recommended first Aurora users document to read to start
+getting up to speed on the system.
+To get help, email questions to the Aurora Developer List,
+## Setup: Install Aurora
+You use the Aurora client and web UI to interact with Aurora jobs. To
+install it locally, see [](/documentation/latest/vagrant/). The remainder of this
+Tutorial assumes you are running Aurora using Vagrant.
+## The Script
+Our "hello world" application is a simple Python script that loops
+forever, displaying the time every few seconds. Copy the code below and
+put it in a file named `` in the root of your Aurora repository clone (Note:
+this directory is the same as `/vagrant` inside the Vagrant VMs).
+The script has an intentional bug, which we will explain later on.
+import sys
+import time
+def main(argv):
+  # Python ninjas - ignore this blatant bug.
+  for i in xrang(100):
+    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
+      time.asctime(), SLEEP_DELAY))
+    sys.stdout.flush()
+    time.sleep(SLEEP_DELAY)
+if __name__ == "__main__":
+  main(sys.argv)
+## Aurora Configuration
+Once we have our script/program, we need to create a *configuration
+file* that tells Aurora how to manage and launch our Job. Save the below
+code in the file `hello_world.aurora` in the same directory as your
+`` file. (all Aurora configuration files end with `.aurora` and
+are written in a Python variant).
+import os
+# copy into the local sandbox
+install = Process(
+  name = 'fetch_package',
+  cmdline = 'cp /vagrant/ . && chmod +x')
+# run the script
+hello_world = Process(
+  name = 'hello_world',
+  cmdline = 'python')
+# describe the task
+hello_world_task = SequentialTask(
+  processes = [install, hello_world],
+  resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB))
+jobs = [
+  Job(name = 'hello_world', cluster = 'example', role = 'www-data',
+      environment = 'devel', task = hello_world_task)
+For more about Aurora configuration files, see the [Configuration
+Tutorial](/documentation/latest/configurationtutorial/) and the [Aurora + Thermos
+Reference](/documentation/latest/configurationreference/) (preferably after finishing this
+## What's Going On In That Configuration File?
+More than you might think.
+1. From a "big picture" viewpoint, it first defines two
+Processes. Then it defines a Task that runs the two Processes in the
+order specified in the Task definition, as well as specifying what
+computational and memory resources are available for them.  Finally,
+it defines a Job that will schedule the Task on available and suitable
+machines. This Job is the sole member of a list of Jobs; you can
+specify more than one Job in a config file.
+2. At the Process level, it specifies how to get your code into the
+local sandbox in which it will run. It then specifies how the code is
+actually run once the second Process starts.
+## Creating the Job
+We're ready to launch our job! To do so, we use the Aurora Client to
+issue a Job creation request to the Aurora scheduler.
+Many Aurora Client commands take a *job key* argument, which uniquely
+identifies a Job. A job key consists of four parts, each separated by a
+"/". The four parts are  `<cluster>/<role>/<environment>/<jobname>`
+in that order. When comparing two job keys, if any of the
+four parts is different from its counterpart in the other key, then the
+two job keys identify two separate jobs. If all four values are
+identical, the job keys identify the same job.
+`/etc/aurora/clusters.json` within the Aurora scheduler has the available
+cluster names. For Vagrant, from the top-level of your Aurora repository clone,
+    $ vagrant ssh aurora-scheduler
+Followed by:
+    vagrant@precise64:~$ cat /etc/aurora/clusters.json
+You'll see something like:
+  "name": "example",
+  "zk": "",
+  "scheduler_zk_path": "/aurora/scheduler",
+  "auth_mechanism": "UNAUTHENTICATED"
+Use a `name` value for your job key's cluster value.
+Role names are user accounts existing on the slave machines. If you don't know what accounts
+are available, contact your sysadmin.
+Environment names are namespaces; you can count on `prod`, `devel` and `test` existing.
+The Aurora Client command that actually runs our Job is `aurora create`. It creates a Job
+specified by its job key and configuration file arguments and runs it.
+    aurora create <cluster>/<role>/<environment>/<jobname> <config_file>
+Or for our example:
+    aurora create example/www-data/devel/hello_world /vagrant/hello_world.aurora
+Note: Remember, the job key's `<jobname>` value is the name of the Job, not the name
+of its code file.
+This returns:
+    $ vagrant ssh aurora-scheduler
+    Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
+     * Documentation:
+    Welcome to your Vagrant-built virtual machine.
+    Last login: Fri Jan  3 02:18:55 2014 from
+    vagrant@precise64:~$ aurora create example/www-data/devel/hello_world \
+        /vagrant/hello_world.aurora
+     INFO] Creating job hello_world
+     INFO] Response from scheduler: OK (message: 1 new tasks pending for job
+      www-data/devel/hello_world)
+     INFO] Job url: http://precise64:8081/scheduler/www-data/devel/hello_world
+## Watching the Job Run
+Now that our job is running, let's see what it's doing. Access the
+scheduler web interface at `http://$scheduler_hostname:$scheduler_port/scheduler`
+Or when using `vagrant`, ``
+First we see what Jobs are scheduled:
+![Scheduled Jobs](images/ScheduledJobs.png)
+Click on your user name, which in this case was `www-data`, and we see the Jobs associated
+with that role:
+![Role Jobs](images/RoleJobs.png)
+Uh oh, that `Unstable` next to our `hello_world` Job doesn't look good. Click the
+`hello_world` Job, and you'll see:
+![hello_world Job](images/HelloWorldJob.png)
+Oops, looks like our first job didn't quite work! The task failed, so we have
+to figure out what went wrong.
+Access the page for our Task by clicking on its Host.
+![Task page](images/TaskBreakdown.png)
+Once there, we see that the
+`hello_world` process failed. The Task page captures the standard error and
+standard output streams and makes them available. Clicking through
+to `stderr` on the failed `hello_world` process, we see what happened.
+![stderr page](images/stderr.png)
+It looks like we made a typo in our Python script. We wanted `xrange`,
+not `xrang`. Edit the `` script, save as `` and change your
+`hello_world.aurora` config file to use `` instead of ``.
+Now that we've updated our configuration, let's restart the job:
+    aurora update example/www-data/devel/hello_world /vagrant/hello_world.aurora
+This time, the task comes up, we inspect the page, and see that the
+`hello_world` process is running.
+![Running Task page](images/runningtask.png)
+We then inspect the output by clicking on `stdout` and see our process'
+![stdout page](images/stdout.png)
+## Cleanup
+Now that we're done, we kill the job using the Aurora client:
+    vagrant@precise64:~$ aurora kill example/www-data/devel/hello_world
+     INFO] Killing tasks for job: example/www-data/devel/hello_world
+     INFO] Response from scheduler: OK (message: Tasks killed.)
+     INFO] Job url: http://precise64:8081/scheduler/www-data/devel/hello_world
+    vagrant@precise64:~$
+The Task scheduler page now shows the `hello_world` process as `KILLED`.
+![Killed Task page](images/killedtask.png)
+## Next Steps
+Now that you've finished this Tutorial, you should read or do the following:
+- [The Aurora Configuration Tutorial](/documentation/latest/configurationtutorial/), which
provides more examples
+  and best practices for writing Aurora configurations. You should also look at
+  the [Aurora + Thermos Configuration Reference](/documentation/latest/configurationreference/).
+- The [Aurora User Guide](/documentation/latest/userguide/) provides an overview of how Aurora,
Mesos, and
+  Thermos work "under the hood".
+- Explore the Aurora Client - use the `aurora help` subcommand, and read the
+  [Aurora Client Commands](/documentation/latest/clientcommands/) document.

Added: incubator/aurora/site/source/documentation/latest/
--- incubator/aurora/site/source/documentation/latest/ (added)
+++ incubator/aurora/site/source/documentation/latest/ Tue Mar 25 06:10:05 2014
@@ -0,0 +1,292 @@
+Aurora User Guide
+- [Overview](#overview)
+- [Job Lifecycle](#job-lifecycle)
+  - [Life Of A Task](#life-of-a-task)
+  - [PENDING to RUNNING states](#pending-to-running-states)
+  - [Task Updates](#task-updates)
+  - [Giving Priority to Production Tasks: PREEMPTING](#giving-priority-to-production-tasks-preempting)
+  - [Natural Termination: FINISHED, FAILED](#natural-termination-finished-failed)
+  - [Forceful Termination: KILLING, RESTARTING](#forceful-termination-killing-restarting)
+- [Configuration](#configuration)
+- [Creating Jobs](#creating-jobs)
+- [Interacting With Jobs](#interacting-with-jobs)
+This document gives an overview of how Aurora works under the hood.
+It assumes you've already worked through the "hello world" example
+job in the [Aurora Tutorial](/documentation/latest/tutorial/). Specifics of how to use Aurora
are **not**
+ given here, but pointers to documentation about how to use Aurora are
+Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos
+cares about individual *tasks*, but typical jobs consist of dozens or
+hundreds of task replicas. Aurora provides a layer on top of Mesos with
+its `Job` abstraction. An Aurora `Job` consists of a task template and
+instructions for creating near-identical replicas of that task (modulo
+things like "instance id" or specific port numbers which may differ from
+machine to machine).
+How many tasks make up a Job is complicated. On a basic level, a Job consists of
+one task template and instructions for creating near-idential replicas of that task
+(otherwise referred to as "instances" or "shards").
+However, since Jobs can be updated on the fly, a single Job identifier or *job key*
+can have multiple job configurations associated with it.
+For example, consider when I have a Job with 4 instances that each
+request 1 core of cpu, 1 GB of RAM, and 1 GB of disk space as specified
+in the configuration file `hello_world.aurora`. I want to
+update it so it requests 2 GB of RAM instead of 1. I create a new
+configuration file to do that called `new_hello_world.aurora` and
+issue a `aurora update --shards=0-1 <job_key_value> new_hello_world.aurora`
+This results in instances 0 and 1 having 1 cpu, 2 GB of RAM, and 1 GB of disk space,
+while instances 2 and 3 have 1 cpu, 1 GB of RAM, and 1 GB of disk space. If instance 3
+dies and restarts, it restarts with 1 cpu, 1 GB RAM, and 1 GB disk space.
+So that means there are two simultaneous task configurations for the same Job
+at the same time, just valid for different ranges of instances.
+This isn't a recommended pattern, but it is valid and supported by the
+Aurora scheduler. This most often manifests in the "canary pattern" where
+instance 0 runs with a different configuration than instances 1-N to test
+different code versions alongside the actual production job.
+A task can merely be a single *process* corresponding to a single
+command line, such as `python2.6`. However, a task can also
+consist of many separate processes, which all run within a single
+sandbox. For example, running multiple cooperating agents together,
+such as `logrotate`, `installer`, master, or slave processes. This is
+where Thermos  comes in. While Aurora provides a `Job` abstraction on
+top of Mesos `Tasks`, Thermos provides a `Process` abstraction
+underneath Mesos `Task`s and serves as part of the Aurora framework's
+You define `Job`s,` Task`s, and `Process`es in a configuration file.
+Configuration files are written in Python, and make use of the Pystachio
+templating language. They end in a `.aurora` extension.
+Pystachio is a type-checked dictionary templating library.
+> TL;DR
+> -   Aurora manages jobs made of tasks.
+> -   Mesos manages tasks made of processes.
+> -   Thermos manages processes.
+> -   All defined in `.aurora` configuration file.
+![Aurora hierarchy](images/aurora_hierarchy.png)
+Each `Task` has a *sandbox* created when the `Task` starts and garbage
+collected when it finishes. All of a `Task'`s processes run in its
+sandbox, so processes can share state by using a shared current working
+The sandbox garbage collection policy considers many factors, most
+importantly age and size. It makes a best-effort attempt to keep
+sandboxes around as long as possible post-task in order for service
+owners to inspect data and logs, should the `Task` have completed
+abnormally. But you can't design your applications assuming sandboxes
+will be around forever, e.g. by building log saving or other
+checkpointing mechanisms directly into your application or into your
+`Job` description.
+Job Lifecycle
+When Aurora reads a configuration file and finds a `Job` definition, it:
+1.  Evaluates the `Job` definition.
+2.  Splits the `Job` into its constituent `Task`s.
+3.  Sends those `Task`s to the scheduler.
+4.  The scheduler puts the `Task`s into `PENDING` state, starting each
+    `Task`'s life cycle.
+**Note**: It is not currently possible to create an Aurora job from
+within an Aurora job.
+### Life Of A Task
+![Life of a task](images/lifeofatask.png)
+### PENDING to RUNNING states
+When a `Task` is in the `PENDING` state, the scheduler constantly
+searches for machines satisfying that `Task`'s resource request
+requirements (RAM, disk space, CPU time) while maintaining configuration
+constraints such as "a `Task` must run on machines  dedicated  to a
+particular role" or attribute limit constraints such as "at most 2
+`Task`s from the same `Job` may run on each rack". When the scheduler
+finds a suitable match, it assigns the `Task` to a machine and puts the
+`Task` into the `ASSIGNED` state.
+From the `ASSIGNED` state, the scheduler sends an RPC to the slave
+machine containing `Task` configuration, which the slave uses to spawn
+an executor responsible for the `Task`'s lifecycle. When the scheduler
+receives an acknowledgement that the machine has accepted the `Task`,
+the `Task` goes into `STARTING` state.
+`STARTING` state initializes a `Task` sandbox. When the sandbox is fully
+initialized, Thermos begins to invoke `Process`es. Also, the slave
+machine sends an update to the scheduler that the `Task` is
+in `RUNNING` state.
+If a `Task` stays in `ASSIGNED` or `STARTING` for too long, the
+scheduler forces it into `LOST` state, creating a new `Task` in its
+place that's sent into `PENDING` state. This is technically true of any
+active state: if the Mesos core tells the scheduler that a slave has
+become unhealthy (or outright disappeared), the `Task`s assigned to that
+slave go into `LOST` state and new `Task`s are created in their place.
+From `PENDING` state, there is no guarantee a `Task` will be reassigned
+to the same machine unless job constraints explicitly force it there.
+If there is a state mismatch, (e.g. a machine returns from a `netsplit`
+and the scheduler has marked all its `Task`s `LOST` and rescheduled
+them), a state reconciliation process kills the errant `RUNNING` tasks,
+which may take up to an hour. But to emphasize this point: there is no
+uniqueness guarantee for a single instance of a job in the presence of
+network partitions. If the Task requires that, it should be baked in at
+the application level using a distributed coordination service such as
+### Task Updates
+`Job` configurations can be updated at any point in their lifecycle.
+Usually updates are done incrementally using a process called a *rolling
+upgrade*, in which Tasks are upgraded in small groups, one group at a
+time.  Updates are done using various Aurora Client commands.
+For a configuration update, the Aurora Client calculates required changes
+by examining the current job config state and the new desired job config.
+It then starts a rolling batched update process by going through every batch
+and performing these operations:
+- If an instance is present in the scheduler but isn't in the new config,
+  then that instance is killed.
+- If an instance is not present in the scheduler but is present in
+  the new config, then the instance is created.
+- If an instance is present in both the scheduler the new config, then
+  the client diffs both task configs. If it detects any changes, it
+  performs an instance update by killing the old config instance and adds
+  the new config instance.
+The Aurora client continues through the instance list until all tasks are
+updated, in `RUNNING,` and healthy for a configurable amount of time.
+If the client determines the update is not going well (a percentage of health
+checks have failed), it cancels the update.
+Update cancellation runs a procedure similar to the described above
+update sequence, but in reverse order. New instance configs are swapped
+with old instance configs and batch updates proceed backwards
+from the point where the update failed. E.g.; (0,1,2) (3,4,5) (6,7,
+8-FAIL) results in a rollback in order (8,7,6) (5,4,3) (2,1,0).
+### Giving Priority to Production Tasks: PREEMPTING
+Sometimes a Task needs to be interrupted, such as when a non-production
+Task's resources are needed by a higher priority production Task. This
+type of interruption is called a *pre-emption*. When this happens in
+Aurora, the non-production Task is killed and moved into
+the `PREEMPTING` state  when both the following are true:
+- The task being killed is a non-production task.
+- The other task is a `PENDING` production task that hasn't been
+  scheduled due to a lack of resources.
+Since production tasks are much more important, Aurora kills off the
+non-production task to free up resources for the production task. The
+scheduler UI shows the non-production task was preempted in favor of the
+production task. At some point, tasks in `PREEMPTING` move to `KILLED`.
+Note that non-production tasks consuming many resources are likely to be
+preempted in favor of production tasks.
+### Natural Termination: FINISHED, FAILED
+A `RUNNING` `Task` can terminate without direct user interaction. For
+example, it may be a finite computation that finishes, even something as
+simple as `echo hello world. `Or it could be an exceptional condition in
+a long-lived service. If the `Task` is successful (its underlying
+processes have succeeded with exit status `0` or finished without
+reaching failure limits) it moves into `FINISHED` state. If it finished
+after reaching a set of failure limits, it goes into `FAILED` state.
+### Forceful Termination: KILLING, RESTARTING
+You can terminate a `Task` by issuing an `aurora kill` command, which
+moves it into `KILLING` state. The scheduler then sends the slave  a
+request to terminate the `Task`. If the scheduler receives a successful
+response, it moves the Task into `KILLED` state and never restarts it.
+The scheduler has access to a non-public `RESTARTING` state. If a `Task`
+is forced into the `RESTARTING` state, the scheduler kills the
+underlying task but in parallel schedules an identical replacement for
+You define and configure your Jobs (and their Tasks and Processes) in
+Aurora configuration files. Their filenames end with the `.aurora`
+suffix, and you write them in Python making use of the Pystashio
+templating language, along
+with specific Aurora, Mesos, and Thermos commands and methods. See the
+[Configuration Guide and Reference](/documentation/latest/configurationreference/) and
+[Configuration Tutorial](/documentation/latest/configurationtutorial/).
+Creating Jobs
+You create and manipulate Aurora Jobs with the Aurora client, which starts all its
+command line commands with
+`aurora`. See [Aurora Client Commands](/documentation/latest/clientcommands/) for details
+about the Aurora Client.
+Interacting With Jobs
+You interact with Aurora jobs either via:
+- Read-only Web UIs
+  Part of the output from creating a new Job is a URL for the Job's scheduler UI page.
+  For example:
+      vagrant@precise64:~$ aurora create example/www-data/prod/hello \
+      /vagrant/examples/jobs/hello_world.aurora
+      INFO] Creating job hello
+      INFO] Response from scheduler: OK (message: 1 new tasks pending for job www-data/prod/hello)
+      INFO] Job url: http://precise64:8081/scheduler/www-data/prod/hello
+  The "Job url" goes to the Job's scheduler UI page. To go to the overall scheduler UI page,
+  stop at the "scheduler" part of the URL, in this case, `http://precise64:8081/scheduler`
+  You can also reach the scheduler UI page via the Client command `aurora open`:
+      aurora open [<cluster>[/<role>[/<env>/<job_name>]]]
+  If only the cluster is specified, it goes directly to that cluster's scheduler main page.
+  If the role is specified, it goes to the top-level role page. If the full job key is specified,
+  it goes directly to the job page where you can inspect individual tasks.
+  Once you click through to a role page, you see Jobs arranged separately by pending jobs,
+  jobs, and finished jobs. Jobs are arranged by role, typically a service account for production
+  jobs and user accounts for test or development jobs.
+- The Aurora Client's command line interface
+  Several Client commands have a `-o` option that automatically opens a window to
+  the specified Job's scheduler UI URL. And, as described above, the `open` command also
+  you there.
+  For a complete list of Aurora Client commands, use `aurora help` and, for specific
+  command help, `aurora help [command]`. **Note**: `aurora help open`
+  returns `"subcommand open not found"` due to our reflection tricks not
+  working on words that are also builtin Python function names. Or see the
+  [Aurora Client Commands](/documentation/latest/clientcommands/) document.

Added: incubator/aurora/site/source/documentation/latest/
--- incubator/aurora/site/source/documentation/latest/ (added)
+++ incubator/aurora/site/source/documentation/latest/ Tue Mar 25 06:10:05 2014
@@ -0,0 +1,18 @@
+Aurora includes a `Vagrantfile` that defines a full Mesos cluster running Aurora. You can
use it to
+explore Aurora's various components. To get started, install
+[VirtualBox]( and [Vagrant](,
+then run `vagrant up` somewhere in the repository source tree to create a team of VMs.  This
may take some time initially as it builds all
+the components involved in running an aurora cluster.
+The scheduler is listening on
+The observer is listening on
+The master is listening on
+Once everything is up, you can `vagrant ssh aurora-scheduler` and execute aurora client commands
using the `aurora` client.
+Most of the vagrant related problems can be fixed by the following steps:
+* Destroying the vagrant environment with `vagrant destroy`
+* Cleaning the repository of build artifacts and other intermediate output with `git clean
+* Bringing up the vagrant environment with `vagrant up`

View raw message