Return-Path: X-Original-To: apmail-aurora-commits-archive@minotaur.apache.org Delivered-To: apmail-aurora-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B06C19AE4 for ; Fri, 15 Apr 2016 20:21:54 +0000 (UTC) Received: (qmail 52954 invoked by uid 500); 15 Apr 2016 20:21:54 -0000 Delivered-To: apmail-aurora-commits-archive@aurora.apache.org Received: (qmail 52911 invoked by uid 500); 15 Apr 2016 20:21:54 -0000 Mailing-List: contact commits-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list commits@aurora.apache.org Received: (qmail 52902 invoked by uid 99); 15 Apr 2016 20:21:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Apr 2016 20:21:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A3239C0F43 for ; Fri, 15 Apr 2016 20:21:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id oLRR43DDnKM9 for ; Fri, 15 Apr 2016 20:21:41 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 77B805F242 for ; Fri, 15 Apr 2016 20:21:40 +0000 (UTC) Received: from svn01-us-west.apache.org (svn.apache.org [10.41.0.6]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0AA62E101F for ; Fri, 15 Apr 2016 20:21:38 +0000 (UTC) Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id 6B85E3A199A for ; Fri, 15 Apr 2016 20:21:37 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1739360 [6/8] - in /aurora/site: ./ data/ publish/ publish/blog/ publish/blog/aurora-0-13-0-released/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publi... Date: Fri, 15 Apr 2016 20:21:35 -0000 To: commits@aurora.apache.org From: jfarrell@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20160415202137.6B85E3A199A@svn01-us-west.apache.org> Added: aurora/site/source/documentation/0.13.0/getting-started/overview.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/getting-started/overview.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/getting-started/overview.md (added) +++ aurora/site/source/documentation/0.13.0/getting-started/overview.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,110 @@ +Aurora System Overview +====================== + +Apache Aurora is a service scheduler that runs on top of Apache Mesos, enabling you to run +long-running services, cron jobs, and ad-hoc jobs that take advantage of Apache Mesos' scalability, +fault-tolerance, and resource isolation. + + +Components +---------- + +It is important to have an understanding of the components that make up +a functioning Aurora cluster. + +![Aurora Components](../images/components.png) + +* **Aurora scheduler** + The scheduler is your primary interface to the work you run in your cluster. You will + instruct it to run jobs, and it will manage them in Mesos for you. You will also frequently use + the scheduler's read-only web interface as a heads-up display for what's running in your cluster. + +* **Aurora client** + The client (`aurora` command) is a command line tool that exposes primitives that you can use to + interact with the scheduler. The client operates on + + Aurora also provides an admin client (`aurora_admin` command) that contains commands built for + cluster administrators. You can use this tool to do things like manage user quotas and manage + graceful maintenance on machines in cluster. + +* **Aurora executor** + The executor (a.k.a. Thermos executor) is responsible for carrying out the workloads described in + the Aurora DSL (`.aurora` files). The executor is what actually executes user processes. It will + also perform health checking of tasks and register tasks in ZooKeeper for the purposes of dynamic + service discovery. + +* **Aurora observer** + The observer provides browser-based access to the status of individual tasks executing on worker + machines. It gives insight into the processes executing, and facilitates browsing of task sandbox + directories. + +* **ZooKeeper** + [ZooKeeper](http://zookeeper.apache.org) is a distributed consensus system. In an Aurora cluster + it is used for reliable election of the leading Aurora scheduler and Mesos master. It is also + used as a vehicle for service discovery, see [Service Discovery](../features/service-discovery.md) + +* **Mesos master** + The master is responsible for tracking worker machines and performing accounting of their + resources. The scheduler interfaces with the master to control the cluster. + +* **Mesos agent** + The agent receives work assigned by the scheduler and executes them. It interfaces with Linux + isolation systems like cgroups, namespaces and Docker to manage the resource consumption of tasks. + When a user task is launched, the agent will launch the executor (in the context of a Linux cgroup + or Docker container depending upon the environment), which will in turn fork user processes. + + +Jobs, Tasks and Processes +-------------------------- + +Aurora is a Mesos framework used to schedule *jobs* onto Mesos. Mesos +cares about individual *tasks*, but typical jobs consist of dozens or +hundreds of task replicas. Aurora provides a layer on top of Mesos with +its `Job` abstraction. An Aurora `Job` consists of a task template and +instructions for creating near-identical replicas of that task (modulo +things like "instance id" or specific port numbers which may differ from +machine to machine). + +How many tasks make up a Job is complicated. On a basic level, a Job consists of +one task template and instructions for creating near-identical replicas of that task +(otherwise referred to as "instances" or "shards"). + +A task can merely be a single *process* corresponding to a single +command line, such as `python2.7 my_script.py`. However, a task can also +consist of many separate processes, which all run within a single +sandbox. For example, running multiple cooperating agents together, +such as `logrotate`, `installer`, master, or slave processes. This is +where Thermos comes in. While Aurora provides a `Job` abstraction on +top of Mesos `Tasks`, Thermos provides a `Process` abstraction +underneath Mesos `Task`s and serves as part of the Aurora framework's +executor. + +You define `Job`s,` Task`s, and `Process`es in a configuration file. +Configuration files are written in Python, and make use of the +[Pystachio](https://github.com/wickman/pystachio) templating language, +along with specific Aurora, Mesos, and Thermos commands and methods. +The configuration files typically end with a `.aurora` extension. + +Summary: + +* Aurora manages jobs made of tasks. +* Mesos manages tasks made of processes. +* Thermos manages processes. +* All that is defined in `.aurora` configuration files + +![Aurora hierarchy](../images/aurora_hierarchy.png) + +Each `Task` has a *sandbox* created when the `Task` starts and garbage +collected when it finishes. All of a `Task'`s processes run in its +sandbox, so processes can share state by using a shared current working +directory. + +The sandbox garbage collection policy considers many factors, most +importantly age and size. It makes a best-effort attempt to keep +sandboxes around as long as possible post-task in order for service +owners to inspect data and logs, should the `Task` have completed +abnormally. But you can't design your applications assuming sandboxes +will be around forever, e.g. by building log saving or other +checkpointing mechanisms directly into your application or into your +`Job` description. + Added: aurora/site/source/documentation/0.13.0/getting-started/tutorial.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/getting-started/tutorial.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/getting-started/tutorial.md (added) +++ aurora/site/source/documentation/0.13.0/getting-started/tutorial.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,258 @@ +# Aurora Tutorial + +This tutorial shows how to use the Aurora scheduler to run (and "`printf-debug`") +a hello world program on Mesos. This is the recommended document for new Aurora users +to start getting up to speed on the system. + +- [Prerequisite](#setup-install-aurora) +- [The Script](#the-script) +- [Aurora Configuration](#aurora-configuration) +- [Creating the Job](#creating-the-job) +- [Watching the Job Run](#watching-the-job-run) +- [Cleanup](#cleanup) +- [Next Steps](#next-steps) + + +## Prerequisite + +This tutorial assumes you are running [Aurora locally using Vagrant](vagrant.md). +However, in general the instructions are also applicable to any other +[Aurora installation](../operations/installation.md). + +Unless otherwise stated, all commands are to be run from the root of the aurora +repository clone. + + +## The Script + +Our "hello world" application is a simple Python script that loops +forever, displaying the time every few seconds. Copy the code below and +put it in a file named `hello_world.py` in the root of your Aurora repository clone +(Note: this directory is the same as `/vagrant` inside the Vagrant VMs). + +The script has an intentional bug, which we will explain later on. + + +```python +import time + +def main(): + SLEEP_DELAY = 10 + # Python experts - ignore this blatant bug. + for i in xrang(100): + print("Hello world! The time is now: %s. Sleeping for %d secs" % ( + time.asctime(), SLEEP_DELAY)) + time.sleep(SLEEP_DELAY) + +if __name__ == "__main__": + main() +``` + +## Aurora Configuration + +Once we have our script/program, we need to create a *configuration +file* that tells Aurora how to manage and launch our Job. Save the below +code in the file `hello_world.aurora`. + + +```python +pkg_path = '/vagrant/hello_world.py' + +# we use a trick here to make the configuration change with +# the contents of the file, for simplicity. in a normal setting, packages would be +# versioned, and the version number would be changed in the configuration. +import hashlib +with open(pkg_path, 'rb') as f: + pkg_checksum = hashlib.md5(f.read()).hexdigest() + +# copy hello_world.py into the local sandbox +install = Process( + name = 'fetch_package', + cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, pkg_checksum)) + +# run the script +hello_world = Process( + name = 'hello_world', + cmdline = 'python -u hello_world.py') + +# describe the task +hello_world_task = SequentialTask( + processes = [install, hello_world], + resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB)) + +jobs = [ + Service(cluster = 'devcluster', + environment = 'devel', + role = 'www-data', + name = 'hello_world', + task = hello_world_task) +] +``` + +There is a lot going on in that configuration file: + +1. From a "big picture" viewpoint, it first defines two +Processes. Then it defines a Task that runs the two Processes in the +order specified in the Task definition, as well as specifying what +computational and memory resources are available for them. Finally, +it defines a Job that will schedule the Task on available and suitable +machines. This Job is the sole member of a list of Jobs; you can +specify more than one Job in a config file. + +2. At the Process level, it specifies how to get your code into the +local sandbox in which it will run. It then specifies how the code is +actually run once the second Process starts. + +For more about Aurora configuration files, see the [Configuration +Tutorial](../reference/configuration-tutorial.md) and the [Configuration +Reference](../reference/configuration.md) (preferably after finishing this +tutorial). + + +## Creating the Job + +We're ready to launch our job! To do so, we use the Aurora Client to +issue a Job creation request to the Aurora scheduler. + +Many Aurora Client commands take a *job key* argument, which uniquely +identifies a Job. A job key consists of four parts, each separated by a +"/". The four parts are `///` +in that order: + +* Cluster refers to the name of a particular Aurora installation. +* Role names are user accounts existing on the slave machines. If you +don't know what accounts are available, contact your sysadmin. +* Environment names are namespaces; you can count on `test`, `devel`, +`staging` and `prod` existing. +* Jobname is the custom name of your job. + +When comparing two job keys, if any of the four parts is different from +its counterpart in the other key, then the two job keys identify two separate +jobs. If all four values are identical, the job keys identify the same job. + +The `clusters.json` [client configuration](../reference/client-cluster-configuration.md) +for the Aurora scheduler defines the available cluster names. +For Vagrant, from the top-level of your Aurora repository clone, do: + + $ vagrant ssh + +Followed by: + + vagrant@aurora:~$ cat /etc/aurora/clusters.json + +You'll see something like the following. The `name` value shown here, corresponds to a job key's cluster value. + +```javascript +[{ + "name": "devcluster", + "zk": "192.168.33.7", + "scheduler_zk_path": "/aurora/scheduler", + "auth_mechanism": "UNAUTHENTICATED", + "slave_run_directory": "latest", + "slave_root": "/var/lib/mesos" +}] +``` + +The Aurora Client command that actually runs our Job is `aurora job create`. It creates a Job as +specified by its job key and configuration file arguments and runs it. + + aurora job create /// + +Or for our example: + + aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora + +After entering our virtual machine using `vagrant ssh`, this returns: + + vagrant@aurora:~$ aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora + INFO] Creating job hello_world + INFO] Checking status of devcluster/www-data/devel/hello_world + Job create succeeded: job url=http://aurora.local:8081/scheduler/www-data/devel/hello_world + + +## Watching the Job Run + +Now that our job is running, let's see what it's doing. Access the +scheduler web interface at `http://$scheduler_hostname:$scheduler_port/scheduler` +Or when using `vagrant`, `http://192.168.33.7:8081/scheduler` +First we see what Jobs are scheduled: + +![Scheduled Jobs](../images/ScheduledJobs.png) + +Click on your user name, which in this case was `www-data`, and we see the Jobs associated +with that role: + +![Role Jobs](../images/RoleJobs.png) + +If you click on your `hello_world` Job, you'll see: + +![hello_world Job](../images/HelloWorldJob.png) + +Oops, looks like our first job didn't quite work! The task is temporarily throttled for +having failed on every attempt of the Aurora scheduler to run it. We have to figure out +what is going wrong. + +On the Completed tasks tab, we see all past attempts of the Aurora scheduler to run our job. + +![Completed tasks tab](../images/CompletedTasks.png) + +We can navigate to the Task page of a failed run by clicking on the host link. + +![Task page](../images/TaskBreakdown.png) + +Once there, we see that the `hello_world` process failed. The Task page +captures the standard error and standard output streams and makes them available. +Clicking through to `stderr` on the failed `hello_world` process, we see what happened. + +![stderr page](../images/stderr.png) + +It looks like we made a typo in our Python script. We wanted `xrange`, +not `xrang`. Edit the `hello_world.py` script to use the correct function +and save it as `hello_world_v2.py`. Then update the `hello_world.aurora` +configuration to the newest version. + +In order to try again, we can now instruct the scheduler to update our job: + + vagrant@aurora:~$ aurora update start devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora + INFO] Starting update for: hello_world + Job update has started. View your update progress at http://aurora.local:8081/scheduler/www-data/devel/hello_world/update/8ef38017-e60f-400d-a2f2-b5a8b724e95b + +This time, the task comes up. + +![Running Job](../images/RunningJob.png) + +By again clicking on the host, we inspect the Task page, and see that the +`hello_world` process is running. + +![Running Task page](../images/runningtask.png) + +We then inspect the output by clicking on `stdout` and see our process' +output: + +![stdout page](../images/stdout.png) + +## Cleanup + +Now that we're done, we kill the job using the Aurora client: + + vagrant@aurora:~$ aurora job killall devcluster/www-data/devel/hello_world + INFO] Killing tasks for job: devcluster/www-data/devel/hello_world + INFO] Instances to be killed: [0] + Successfully killed instances [0] + Job killall succeeded + +The job page now shows the `hello_world` tasks as completed. + +![Killed Task page](../images/killedtask.png) + +## Next Steps + +Now that you've finished this Tutorial, you should read or do the following: + +- [The Aurora Configuration Tutorial](../reference/configuration-tutorial.md), which provides more examples + and best practices for writing Aurora configurations. You should also look at + the [Aurora Configuration Reference](../reference/configuration.md). +- Explore the Aurora Client - use `aurora -h`, and read the + [Aurora Client Commands](../reference/client-commands.md) document. Added: aurora/site/source/documentation/0.13.0/getting-started/vagrant.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/getting-started/vagrant.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/getting-started/vagrant.md (added) +++ aurora/site/source/documentation/0.13.0/getting-started/vagrant.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,137 @@ +A local Cluster with Vagrant +============================ + +This document shows you how to configure a complete cluster using a virtual machine. This setup +replicates a real cluster in your development machine as closely as possible. After you complete +the steps outlined here, you will be ready to create and run your first Aurora job. + +The following sections describe these steps in detail: + +1. [Overview](#user-content-overview) +1. [Install VirtualBox and Vagrant](#user-content-install-virtualbox-and-vagrant) +1. [Clone the Aurora repository](#user-content-clone-the-aurora-repository) +1. [Start the local cluster](#user-content-start-the-local-cluster) +1. [Log onto the VM](#user-content-log-onto-the-vm) +1. [Run your first job](#user-content-run-your-first-job) +1. [Rebuild components](#user-content-rebuild-components) +1. [Shut down or delete your local cluster](#user-content-shut-down-or-delete-your-local-cluster) +1. [Troubleshooting](#user-content-troubleshooting) + + +Overview +-------- + +The Aurora distribution includes a set of scripts that enable you to create a local cluster in +your development machine. These scripts use [Vagrant](https://www.vagrantup.com/) and +[VirtualBox](https://www.virtualbox.org/) to run and configure a virtual machine. Once the +virtual machine is running, the scripts install and initialize Aurora and any required components +to create the local cluster. + + +Install VirtualBox and Vagrant +------------------------------ + +First, download and install [VirtualBox](https://www.virtualbox.org/) on your development machine. + +Then download and install [Vagrant](https://www.vagrantup.com/). To verify that the installation +was successful, open a terminal window and type the `vagrant` command. You should see a list of +common commands for this tool. + + +Clone the Aurora repository +--------------------------- + +To obtain the Aurora source distribution, clone its Git repository using the following command: + + git clone git://git.apache.org/aurora.git + + +Start the local cluster +----------------------- + +Now change into the `aurora/` directory, which contains the Aurora source code and +other scripts and tools: + + cd aurora/ + +To start the local cluster, type the following command: + + vagrant up + +This command uses the configuration scripts in the Aurora distribution to: + +* Download a Linux system image. +* Start a virtual machine (VM) and configure it. +* Install the required build tools on the VM. +* Install Aurora's requirements (like [Mesos](http://mesos.apache.org/) and +[Zookeeper](http://zookeeper.apache.org/)) on the VM. +* Build and install Aurora from source on the VM. +* Start Aurora's services on the VM. + +This process takes several minutes to complete. + +To verify that Aurora is running on the cluster, visit the following URLs: + +* Scheduler - http://192.168.33.7:8081 +* Observer - http://192.168.33.7:1338 +* Mesos Master - http://192.168.33.7:5050 +* Mesos Slave - http://192.168.33.7:5051 + + +Log onto the VM +--------------- + +To SSH into the VM, run the following command in your development machine: + + vagrant ssh + +To verify that Aurora is installed in the VM, type the `aurora` command. You should see a list +of arguments and possible commands. + +The `/vagrant` directory on the VM is mapped to the `aurora/` local directory +from which you started the cluster. You can edit files inside this directory in your development +machine and access them from the VM under `/vagrant`. + +A pre-installed `clusters.json` file refers to your local cluster as `devcluster`, which you +will use in client commands. + + +Run your first job +------------------ + +Now that your cluster is up and running, you are ready to define and run your first job in Aurora. +For more information, see the [Aurora Tutorial](tutorial.md). + + +Rebuild components +------------------ + +If you are changing Aurora code and would like to rebuild a component, you can use the `aurorabuild` +command on the VM to build and restart a component. This is considerably faster than destroying +and rebuilding your VM. + +`aurorabuild` accepts a list of components to build and update. To get a list of supported +components, invoke the `aurorabuild` command with no arguments: + + vagrant ssh -c 'aurorabuild client' + + +Shut down or delete your local cluster +-------------------------------------- + +To shut down your local cluster, run the `vagrant halt` command in your development machine. To +start it again, run the `vagrant up` command. + +Once you are finished with your local cluster, or if you would otherwise like to start from scratch, +you can use the command `vagrant destroy` to turn off and delete the virtual file system. + + +Troubleshooting +--------------- + +Most of the vagrant related problems can be fixed by the following steps: + +* Destroying the vagrant environment with `vagrant destroy` +* Killing any orphaned VMs (see AURORA-499) with `virtualbox` UI or `VBoxManage` command line tool +* Cleaning the repository of build artifacts and other intermediate output with `git clean -fdx` +* Bringing up the vagrant environment with `vagrant up` Added: aurora/site/source/documentation/0.13.0/images/CPUavailability.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/CPUavailability.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/CPUavailability.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/CompletedTasks.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/CompletedTasks.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/CompletedTasks.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/HelloWorldJob.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/HelloWorldJob.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/HelloWorldJob.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/RoleJobs.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/RoleJobs.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/RoleJobs.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/RunningJob.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/RunningJob.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/RunningJob.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/ScheduledJobs.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/ScheduledJobs.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/ScheduledJobs.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/TaskBreakdown.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/TaskBreakdown.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/TaskBreakdown.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/aurora_hierarchy.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/aurora_hierarchy.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/aurora_hierarchy.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/aurora_logo.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/aurora_logo.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/aurora_logo.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/components.odg URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/components.odg?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/components.odg ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/components.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/components.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/components.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/debug-client-test.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/debug-client-test.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/debug-client-test.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/debugging-client-test.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/debugging-client-test.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/debugging-client-test.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/killedtask.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/killedtask.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/killedtask.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/lifeofatask.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/lifeofatask.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/lifeofatask.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_adopters_panel_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_tellapart_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/02_19_2015_aurora_at_twitter_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/02_28_2015_apache_aurora_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/02_28_2015_apache_aurora_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/02_28_2015_apache_aurora_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/03_07_2015_aurora_mesos_in_practice_at_twitter_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/03_25_2014_introduction_to_aurora_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/04_30_2015_monolith_to_microservices_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/08_21_2014_past_present_future_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/08_21_2014_past_present_future_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/08_21_2014_past_present_future_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_shipping_code_with_aurora_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/09_20_2015_twitter_production_scale_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_mesos_aurora_on_a_small_scale_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/presentations/10_08_2015_sla_aware_maintenance_for_operators_thumb.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/runningtask.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/runningtask.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/runningtask.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/stderr.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/stderr.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/stderr.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/stdout.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/stdout.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/stdout.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/images/storage_hierarchy.png URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/images/storage_hierarchy.png?rev=1739360&view=auto ============================================================================== Binary file - no diff available. Propchange: aurora/site/source/documentation/0.13.0/images/storage_hierarchy.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: aurora/site/source/documentation/0.13.0/index.html.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/index.html.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/index.html.md (added) +++ aurora/site/source/documentation/0.13.0/index.html.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,73 @@ +## Introduction + +Apache Aurora is a service scheduler that runs on top of Apache Mesos, enabling you to run +long-running services, cron jobs, and ad-hoc jobs that take advantage of Apache Mesos' scalability, +fault-tolerance, and resource isolation. + +We encourage you to ask questions on the [Aurora user list](http://aurora.apache.org/community/) or +the `#aurora` IRC channel on `irc.freenode.net`. + + +## Getting Started +Information for everyone new to Apache Aurora. + + * [Aurora System Overview](getting-started/overview.md) + * [Hello World Tutorial](getting-started/tutorial.md) + * [Local cluster with Vagrant](getting-started/vagrant.md) + +## Features +Description of important Aurora features. + + * [Containers](features/containers.md) + * [Cron Jobs](features/cron-jobs.md) + * [Job Updates](features/job-updates.md) + * [Multitenancy](features/multitenancy.md) + * [Resource Isolation](features/resource-isolation.md) + * [Scheduling Constraints](features/constraints.md) + * [Services](features/services.md) + * [Service Discovery](features/service-discovery.md) + * [SLA Metrics](features/sla-metrics.md) + +## Operators +For those that wish to manage and fine-tune an Aurora cluster. + + * [Installation](operations/installation.md) + * [Configuration](operations/configuration.md) + * [Monitoring](operations/monitoring.md) + * [Security](operations/security.md) + * [Storage](operations/storage.md) + * [Backup](operations/backup-restore.md) + +## Reference +The complete reference of commands, configuration options, and scheduler internals. + + * [Task lifecycle](reference/task-lifecycle.md) + * Configuration (`.aurora` files) + - [Configuration Reference](reference/configuration.md) + - [Configuration Tutorial](reference/configuration-tutorial.md) + - [Configuration Best Practices](reference/configuration-best-practices.md) + - [Configuration Templating](reference/configuration-templating.md) + * Aurora Client + - [Client Commands](reference/client-commands.md) + - [Client Hooks](reference/client-hooks.md) + - [Client Cluster Configuration](reference/client-cluster-configuration.md) + * [Scheduler Configuration](reference/scheduler-configuration.md) + +## Additional Resources + * [Tools integrating with Aurora](additional-resources/tools.md) + * [Presentation videos and slides](additional-resources/presentations.md) + +## Developers +All the information you need to start modifying Aurora and contributing back to the project. + + * [Contributing to the project](contributing/) + * [Committer's Guide](development/committers-guide.md) + * [Design Documents](development/design-documents.md) + * Developing the Aurora components: + - [Client](development/client.md) + - [Scheduler](development/scheduler.md) + - [Scheduler UI](development/ui.md) + - [Thermos](development/thermos.md) + - [Thrift structures](development/thrift.md) + + Added: aurora/site/source/documentation/0.13.0/operations/backup-restore.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/operations/backup-restore.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/operations/backup-restore.md (added) +++ aurora/site/source/documentation/0.13.0/operations/backup-restore.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,91 @@ +# Recovering from a Scheduler Backup + +**Be sure to read the entire page before attempting to restore from a backup, as it may have +unintended consequences.** + +# Summary + +The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an +earlier, backed up, version and requires all schedulers to be taken down temporarily while +restoring. Once completed, the scheduler state resets to what it was when the backup was created. +This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will +be killed shortly after the cluster restarts. All other tasks continue operating as normal. + +Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few +hours). This is because the scheduler will expect the cluster to look exactly as the backup does, +so any tasks that have been rescheduled since the backup was taken will be killed. + +Instructions below have been verified in [Vagrant environment](../getting-started/vagrant.md) and with minor +syntax/path changes should be applicable to any Aurora cluster. + +# Preparation + +Follow these steps to prepare the cluster for restoring from a backup: + +* Stop all scheduler instances + +* Consider blocking external traffic on a port defined in `-http_port` for all schedulers to +prevent users from interacting with the scheduler during the restoration process. This will help +troubleshooting by reducing the scheduler log noise and prevent users from making changes that will +be erased after the backup snapshot is restored. + +* Configure `aurora_admin` access to run all commands listed in + [Restore from backup](#restore-from-backup) section locally on the leading scheduler: + * Make sure the [clusters.json](../reference/client-cluster-configuration.md) file configured to + access scheduler directly. Set `scheduler_uri` setting and remove `zk`. Since leader can get + re-elected during the restore steps, consider doing it on all scheduler replicas. + * Depending on your particular security approach you will need to either turn off scheduler + authorization by removing scheduler `-http_authentication_mechanism` flag or make sure the + direct scheduler access is properly authorized. E.g.: in case of Kerberos you will need to make + a `/etc/hosts` file change to match your local IP to the scheduler URL configured in keytabs: + + + +* Next steps are required to put scheduler into a partially disabled state where it would still be +able to accept storage recovery requests but unable to schedule or change task states. This may be +accomplished by updating the following scheduler configuration options: + * Set `-mesos_master_address` to a non-existent zk address. This will prevent scheduler from + registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:1111/mesos/master` + * `-max_registration_delay` - set to sufficiently long interval to prevent registration timeout + and as a result scheduler suicide. E.g: `-max_registration_delay=360mins` + * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: `365days`) to + prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster + state and will kill all tasks when restarted with an empty Mesos replicated log. + +* Restart all schedulers + +# Cleanup and re-initialize Mesos replicated log + +Get rid of the corrupted files and re-initialize Mesos replicated log: + +* Stop schedulers +* Delete all files under `-native_log_file_path` on all schedulers +* Initialize Mesos replica's log file: `sudo mesos-log initialize --path=<-native_log_file_path>` +* Start schedulers + +# Restore from backup + +At this point the scheduler is ready to rehydrate from the backup: + +* Identify the leading scheduler by: + * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at the scheduler + `/vars` endpoint. Leader will have 1. All other replicas - 0. + * examining scheduler logs + * or examining Zookeeper registration under the path defined by `-zk_endpoints` + and `-serverset_path` + +* Locate the desired backup file, copy it to the leading scheduler's `-backup_dir` folder and stage +recovery by running the following command on a leader +`aurora_admin scheduler_stage_recovery --bypass-leader-redirect scheduler-backup-` + +* At this point, the recovery snapshot is staged and available for manual verification/modification +via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and +`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands. +See `aurora_admin help ` for usage details. + +* Commit recovery. This instructs the scheduler to overwrite the existing Mesos replicated log with +the provided backup snapshot and initiate a mandatory failover +`aurora_admin scheduler_commit_recovery --bypass-leader-redirect ` + +# Cleanup +Undo any modification done during [Preparation](#preparation) sequence. Added: aurora/site/source/documentation/0.13.0/operations/configuration.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/operations/configuration.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/operations/configuration.md (added) +++ aurora/site/source/documentation/0.13.0/operations/configuration.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,187 @@ +# Scheduler Configuration + +The Aurora scheduler can take a variety of configuration options through command-line arguments. +Examples are available under `examples/scheduler/`. For a list of available Aurora flags and their +documentation, see [Scheduler Configuration Reference](../reference/scheduler-configuration.md). + + +## A Note on Configuration +Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora +"configuration file" is typically a `scheduler.sh` shell script of the form. + + #!/bin/bash + AURORA_HOME=/usr/local/aurora-scheduler + + # Flags controlling the JVM. + JAVA_OPTS=( + -Xmx2g + -Xms2g + # GC tuning, etc. + ) + + # Flags controlling the scheduler. + AURORA_FLAGS=( + # Port for client RPCs and the web UI + -http_port=8081 + # Log configuration, etc. + ) + + # Environment variables controlling libmesos + export JAVA_HOME=... + export GLOG_v=1 + # Port used to communicate with the Mesos master and for the replicated log + export LIBPROCESS_PORT=8083 + + JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" "${AURORA_FLAGS[@]}" + +That way Aurora's current flags are visible in `ps` and in the `/vars` admin endpoint. + + +## Replicated Log Configuration + +Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is +leader at a given time - the other schedulers follow log writes and prepare to take over as leader +but do not communicate with the Mesos master. Either 3 or 5 schedulers are recommended in a +production deployment depending on failure tolerance and they must have persistent storage. + +Below is a summary of scheduler storage configuration flags that either don't have default values +or require attention before deploying in a production environment. + +### `-native_log_quorum_size` +Defines the Mesos replicated log quorum size. In a cluster with `N` schedulers, the flag +`-native_log_quorum_size` should be set to `floor(N/2) + 1`. So in a cluster with 1 scheduler +it should be set to `1`, in a cluster with 3 it should be set to `2`, and in a cluster of 5 it +should be set to `3`. + + Number of schedulers (N) | ```-native_log_quorum_size``` setting (```floor(N/2) + 1```) + ------------------------ | ------------------------------------------------------------- + 1 | 1 + 3 | 2 + 5 | 3 + 7 | 4 + +*Incorrectly setting this flag will cause data corruption to occur!* + +### `-native_log_file_path` +Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD) +for Mesos replicated log files to ensure optimal storage performance. + +### `-native_log_zk_group_path` +ZooKeeper path used for Mesos replicated log quorum discovery. + +See [code](../../src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java) for +other available Mesos replicated log configuration options and default values. + +### Changing the Quorum Size +Special care needs to be taken when changing the size of the Aurora scheduler quorum. +Since Aurora uses a Mesos replicated log, similar steps need to be followed as when +[changing the mesos quorum size](http://mesos.apache.org/documentation/latest/operational-guide). + +As a preparation, increase `-native_log_quorum_size` on each existing scheduler and restart them. +When updating from 3 to 5 schedulers, the quorum size would grow from 2 to 3. + +When starting the new schedulers, use the `-native_log_quorum_size` set to the new value. Failing to +first increase the quorum size on running schedulers can in some cases result in corruption +or truncating of the replicated log used by Aurora. In that case, see the documentation on +[recovering from backup](backup-restore.md). + + +## Backup Configuration + +Configuration options for the Aurora scheduler backup manager. + +### `-backup_interval` +The interval on which the scheduler writes local storage backups. The default is every hour. + +### `-backup_dir` +Directory to write backups to. + +### `-max_saved_backups` +Maximum number of backups to retain before deleting the oldest backup(s). + + +## Process Logs + +### Log destination +By default, Thermos will write process stdout/stderr to log files in the sandbox. Process object configuration +allows specifying alternate log file destinations like streamed stdout/stderr or suppression of all log output. +Default behavior can be configured for the entire cluster with the following flag (through the `-thermos_executor_flags` +argument to the Aurora scheduler): + + --runner-logger-destination=both + +`both` configuration will send logs to files and stream to parent stdout/stderr outputs. + +See [Configuration Reference](../reference/configuration.md#logger) for all destination options. + +### Log rotation +By default, Thermos will not rotate the stdout/stderr logs from child processes and they will grow +without bound. An individual user may change this behavior via configuration on the Process object, +but it may also be desirable to change the default configuration for the entire cluster. +In order to enable rotation by default, the following flags can be applied to Thermos (through the +-thermos_executor_flags argument to the Aurora scheduler): + + --runner-logger-mode=rotate + --runner-rotate-log-size-mb=100 + --runner-rotate-log-backups=10 + +In the above example, each instance of the Thermos runner will rotate stderr/stdout logs once they +reach 100 MiB in size and keep a maximum of 10 backups. If a user has provided a custom setting for +their process, it will override these default settings. + + + +## Thermos Executor Wrapper + +If you need to do computation before starting the thermos executor (for example, setting a different +`--announcer-hostname` parameter for every executor), then the thermos executor should be invoked + inside a wrapper script. In such a case, the aurora scheduler should be started with + `-thermos_executor_path` pointing to the wrapper script and `-thermos_executor_resources` + set to a comma separated string of all the resources that should be copied into + the sandbox (including the original thermos executor). + +For example, to wrap the executor inside a simple wrapper, the scheduler will be started like this +`-thermos_executor_path=/path/to/wrapper.sh -thermos_executor_resources=/usr/share/aurora/bin/thermos_executor.pex` + + + +### Docker containers +In order for Aurora to launch jobs using docker containers, a few extra configuration options +must be set. The [docker containerizer](http://mesos.apache.org/documentation/latest/docker-containerizer/) +must be enabled on the mesos slaves by launching them with the `--containerizers=docker,mesos` option. + +By default, Aurora will configure Mesos to copy the file specified in `-thermos_executor_path` +into the container's sandbox. If using a wrapper script to launch the thermos executor, +specify the path to the wrapper in that argument. In addition, the path to the executor pex itself +must be included in the `-thermos_executor_resources` option. Doing so will ensure that both the +wrapper script and executor are correctly copied into the sandbox. Finally, ensure the wrapper +script does not access resources outside of the sandbox, as when the script is run from within a +docker container those resources will not exist. + +A scheduler flag, `-global_container_mounts` allows mounting paths from the host (i.e., the slave) +into all containers on that host. The format is a comma separated list of host_path:container_path[:mode] +tuples. For example `-global_container_mounts=/opt/secret_keys_dir:/mnt/secret_keys_dir:ro` mounts +`/opt/secret_keys_dir` from the slaves into all launched containers. Valid modes are `ro` and `rw`. + +If you would like to run a container with a read-only filesystem, it may also be necessary to +pass to use the scheduler flag `-thermos_home_in_sandbox` in order to set HOME to the sandbox +before the executor runs. This will make sure that the executor/runner PEX extractions happens +inside of the sandbox instead of the container filesystem root. + +If you would like to supply your own parameters to `docker run` when launching jobs in docker +containers, you may use the following flags: + + -allow_docker_parameters + -default_docker_parameters + +`-allow_docker_parameters` controls whether or not users may pass their own configuration parameters +through the job configuration files. If set to `false` (the default), the scheduler will reject +jobs with custom parameters. *NOTE*: this setting should be used with caution as it allows any job +owner to specify any parameters they wish, including those that may introduce security concerns +(`privileged=true`, for example). + +`-default_docker_parameters` allows a cluster operator to specify a universal set of parameters that +should be used for every container that does not have parameters explicitly configured at the job +level. The argument accepts a multimap format: + + -default_docker_parameters="read-only=true,tmpfs=/tmp,tmpfs=/run" Added: aurora/site/source/documentation/0.13.0/operations/installation.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/operations/installation.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/operations/installation.md (added) +++ aurora/site/source/documentation/0.13.0/operations/installation.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,324 @@ +# Installing Aurora + +Source and binary distributions can be found on our +[downloads](https://aurora.apache.org/downloads/) page. Installing from binary packages is +recommended for most. + +- [Installing the scheduler](#installing-the-scheduler) +- [Installing worker components](#installing-worker-components) +- [Installing the client](#installing-the-client) +- [Installing Mesos](#installing-mesos) +- [Troubleshooting](#troubleshooting) + +If our binay packages don't suite you, our package build toolchain makes it easy to build your +own packages. See the [instructions](https://github.com/apache/aurora-packaging) to learn how. + + +## Machine profiles + +Given that many of these components communicate over the network, there are numerous ways you could +assemble them to create an Aurora cluster. The simplest way is to think in terms of three machine +profiles: + +### Coordinator +**Components**: ZooKeeper, Aurora scheduler, Mesos master + +A small number of machines (typically 3 or 5) responsible for cluster orchestration. In most cases +it is fine to co-locate these components in anything but very large clusters (> 1000 machines). +Beyond that point, operators will likely want to manage these services on separate machines. + +In practice, 5 coordinators have been shown to reliably manage clusters with tens of thousands of +machines. + +### Worker +**Components**: Aurora executor, Aurora observer, Mesos agent + +The bulk of the cluster, where services will actually run. + +### Client +**Components**: Aurora client, Aurora admin client + +Any machines that users submit jobs from. + + +## Installing the scheduler +### Ubuntu Trusty + +1. Install Mesos + Skip down to [install mesos](#mesos-on-ubuntu-trusty), then run: + + sudo start mesos-master + +2. Install ZooKeeper + + sudo apt-get install -y zookeeperd + +3. Install the Aurora scheduler + + sudo add-apt-repository -y ppa:openjdk-r/ppa + sudo apt-get update + sudo apt-get install -y openjdk-8-jre-headless wget + + sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java + + wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-scheduler_0.12.0_amd64.deb + sudo dpkg -i aurora-scheduler_0.12.0_amd64.deb + +### CentOS 7 + +1. Install Mesos + Skip down to [install mesos](#mesos-on-centos-7), then run: + + sudo systemctl start mesos-master + +2. Install ZooKeeper + + sudo rpm -Uvh https://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm + sudo yum install -y java-1.8.0-openjdk-headless zookeeper-server + + sudo service zookeeper-server init + sudo systemctl start zookeeper-server + +3. Install the Aurora scheduler + + sudo yum install -y wget + + wget -c https://apache.bintray.com/aurora/centos-7/aurora-scheduler-0.12.0-1.el7.centos.aurora.x86_64.rpm + sudo yum install -y aurora-scheduler-0.12.0-1.el7.centos.aurora.x86_64.rpm + +### Finalizing +By default, the scheduler will start in an uninitialized mode. This is because external +coordination is necessary to be certain operator error does not result in a quorum of schedulers +starting up and believing their databases are empty when in fact they should be re-joining a +cluster. + +Because of this, a fresh install of the scheduler will need intervention to start up. First, +stop the scheduler service. +Ubuntu: `sudo stop aurora-scheduler` +CentOS: `sudo systemctl stop aurora` + +Now initialize the database: + + sudo -u aurora mkdir -p /var/lib/aurora/scheduler/db + sudo -u aurora mesos-log initialize --path=/var/lib/aurora/scheduler/db + +Now you can start the scheduler back up. +Ubuntu: `sudo start aurora-scheduler` +CentOS: `sudo systemctl start aurora` + + +## Installing worker components +### Ubuntu Trusty + +1. Install Mesos + Skip down to [install mesos](#mesos-on-ubuntu-trusty), then run: + + start mesos-slave + +2. Install Aurora executor and observer + + sudo apt-get install -y python2.7 wget + + # NOTE: This appears to be a missing dependency of the mesos deb package and is needed + # for the python mesos native bindings. + sudo apt-get -y install libcurl4-nss-dev + + wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-executor_0.12.0_amd64.deb + sudo dpkg -i aurora-executor_0.12.0_amd64.deb + +### CentOS 7 + +1. Install Mesos + Skip down to [install mesos](#mesos-on-centos-7), then run: + + sudo systemctl start mesos-slave + +2. Install Aurora executor and observer + + sudo yum install -y python2 wget + + wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.12.0-1.el7.centos.aurora.x86_64.rpm + sudo yum install -y aurora-executor-0.12.0-1.el7.centos.aurora.x86_64.rpm + +### Configuration +The executor typically does not require configuration. Command line arguments can +be passed to the executor using a command line argument on the scheduler. + +The observer needs to be configured to look at the correct mesos directory in order to find task +sandboxes. You should 1st find the Mesos working directory by looking for the Mesos slave +`--work_dir` flag. You should see something like: + + ps -eocmd | grep "mesos-slave" | grep -v grep | tr ' ' '\n' | grep "\--work_dir" + --work_dir=/var/lib/mesos + +If the flag is not set, you can view the default value like so: + + mesos-slave --help + Usage: mesos-slave [options] + + ... + --work_dir=VALUE Directory path to place framework work directories + (default: /tmp/mesos) + ... + +The value you find for `--work_dir`, `/var/lib/mesos` in this example, should match the Aurora +observer value for `--mesos-root`. You can look for that setting in a similar way on a worker +node by grepping for `thermos_observer` and `--mesos-root`. If the flag is not set, you can view +the default value like so: + + thermos_observer -h + Options: + ... + --mesos-root=MESOS_ROOT + The mesos root directory to search for Thermos + executor sandboxes [default: /var/lib/mesos] + ... + +In this case the default is `/var/lib/mesos` and we have a match. If there is no match, you can +either adjust the mesos-master start script(s) and restart the master(s) or else adjust the +Aurora observer start scripts and restart the observers. To adjust the Aurora observer: + +#### Ubuntu Trusty + + sudo sh -c 'echo "MESOS_ROOT=/tmp/mesos" >> /etc/default/thermos' + +NB: In Aurora releases up through 0.12.0, you'll also need to edit /etc/init/thermos.conf like so: + + diff -C 1 /etc/init/thermos.conf.orig /etc/init/thermos.conf + *** /etc/init/thermos.conf.orig 2016-03-22 22:34:46.286199718 +0000 + --- /etc/init/thermos.conf 2016-03-22 17:09:49.357689038 +0000 + *************** + *** 24,25 **** + --- 24,26 ---- + --port=${OBSERVER_PORT:-1338} \ + + --mesos-root=${MESOS_ROOT:-/var/lib/mesos} \ + --log_to_disk=NONE \ + +#### CentOS 7 + +Make an edit to add the `--mesos-root` flag resulting in something like: + + grep -A5 OBSERVER_ARGS /etc/sysconfig/thermos-observer + OBSERVER_ARGS=( + --port=1338 + --mesos-root=/tmp/mesos + --log_to_disk=NONE + --log_to_stderr=google:INFO + ) + +## Installing the client +### Ubuntu Trusty + + sudo apt-get install -y python2.7 wget + + wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-tools_0.12.0_amd64.deb + sudo dpkg -i aurora-tools_0.12.0_amd64.deb + +### CentOS 7 + + sudo yum install -y python2 wget + + wget -c https://apache.bintray.com/aurora/centos-7/aurora-tools-0.12.0-1.el7.centos.aurora.x86_64.rpm + sudo yum install -y aurora-tools-0.12.0-1.el7.centos.aurora.x86_64.rpm + +### Mac OS X + + brew upgrade + brew install aurora-cli + +### Configuration +Client configuration lives in a json file that describes the clusters available and how to reach +them. By default this file is at `/etc/aurora/clusters.json`. + +Jobs may be submitted to the scheduler using the client, and are described with +[job configurations](../reference/configuration.md) expressed in `.aurora` files. Typically you will +maintain a single job configuration file to describe one or more deployment environments (e.g. +dev, test, prod) for a production job. + + +## Installing Mesos +Mesos uses a single package for the Mesos master and slave. As a result, the package dependencies +are identical for both. + +### Mesos on Ubuntu Trusty + + sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF + DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]') + CODENAME=$(lsb_release -cs) + + echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \ + sudo tee /etc/apt/sources.list.d/mesosphere.list + sudo apt-get -y update + + # Use `apt-cache showpkg mesos | grep [version]` to find the exact version. + sudo apt-get -y install mesos=0.25.0-0.2.70.ubuntu1404 + +### Mesos on CentOS 7 + + sudo rpm -Uvh https://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm + sudo yum -y install mesos-0.25.0 + + + +## Troubleshooting +So you've started your first cluster and are running into some issues? We've collected some common +stumbling blocks and solutions here to help get you moving. + +### Replicated log not initialized + +#### Symptoms +- Scheduler RPCs and web interface claim `Storage is not READY` +- Scheduler log repeatedly prints messages like + + ``` + I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status + received a broadcasted recover request + I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response + from a replica in EMPTY status + ``` + +#### Solution +When you create a new cluster, you need to inform a quorum of schedulers that they are safe to +consider their database to be empty by [initializing](#finalizing) the +replicated log. This is done to prevent the scheduler from modifying the cluster state in the event +of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path. + + +### Scheduler not registered + +#### Symptoms +Scheduler log contains + + Framework has not been registered within the tolerated delay. + +#### Solution +Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering +the master in ZooKeeper, make sure command line argument to the master: + + --zk=zk://$ZK_HOST:2181/mesos/master + +is the same as the one on the scheduler: + + -mesos_master_address=zk://$ZK_HOST:2181/mesos/master + + +### Scheduler not running + +### Symptom +The scheduler process commits suicide regularly. This happens under error conditions, but +also on purpose in regular intervals. + +## Solution +Aurora is meant to be run under supervision. You have to configure a supervisor like +[Monit](http://mmonit.com/monit/) or [supervisord](http://supervisord.org/) to run the scheduler +and restart it whenever it fails or exists on purpose. + +Aurora supports an active health checking protocol on its admin HTTP interface - if a `GET /health` +times out or returns anything other than `200 OK` the scheduler process is unhealthy and should be +restarted. + +For example, monit can be configured with + + if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart + +assuming you set `-http_port=8081`. Added: aurora/site/source/documentation/0.13.0/operations/monitoring.md URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/0.13.0/operations/monitoring.md?rev=1739360&view=auto ============================================================================== --- aurora/site/source/documentation/0.13.0/operations/monitoring.md (added) +++ aurora/site/source/documentation/0.13.0/operations/monitoring.md Fri Apr 15 20:21:30 2016 @@ -0,0 +1,181 @@ +# Monitoring your Aurora cluster + +Before you start running important services in your Aurora cluster, it's important to set up +monitoring and alerting of Aurora itself. Most of your monitoring can be against the scheduler, +since it will give you a global view of what's going on. + +## Reading stats +The scheduler exposes a *lot* of instrumentation data via its HTTP interface. You can get a quick +peek at the first few of these in our vagrant image: + + $ vagrant ssh -c 'curl -s localhost:8081/vars | head' + async_tasks_completed 1004 + attribute_store_fetch_all_events 15 + attribute_store_fetch_all_events_per_sec 0.0 + attribute_store_fetch_all_nanos_per_event 0.0 + attribute_store_fetch_all_nanos_total 3048285 + attribute_store_fetch_all_nanos_total_per_sec 0.0 + attribute_store_fetch_one_events 3391 + attribute_store_fetch_one_events_per_sec 0.0 + attribute_store_fetch_one_nanos_per_event 0.0 + attribute_store_fetch_one_nanos_total 454690753 + +These values are served as `Content-Type: text/plain`, with each line containing a space-separated metric +name and value. Values may be integers, doubles, or strings (note: strings are static, others +may be dynamic). + +If your monitoring infrastructure prefers JSON, the scheduler exports that as well: + + $ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head' + { + "async_tasks_completed": 1009, + "attribute_store_fetch_all_events": 15, + "attribute_store_fetch_all_events_per_sec": 0.0, + "attribute_store_fetch_all_nanos_per_event": 0.0, + "attribute_store_fetch_all_nanos_total": 3048285, + "attribute_store_fetch_all_nanos_total_per_sec": 0.0, + "attribute_store_fetch_one_events": 3409, + "attribute_store_fetch_one_events_per_sec": 0.0, + "attribute_store_fetch_one_nanos_per_event": 0.0, + +This will be the same data as above, served with `Content-Type: application/json`. + +## Viewing live stat samples on the scheduler +The scheduler uses the Twitter commons stats library, which keeps an internal time-series database +of exported variables - nearly everything in `/vars` is available for instant graphing. This is +useful for debugging, but is not a replacement for an external monitoring system. + +You can view these graphs on a scheduler at `/graphview`. It supports some composition and +aggregation of values, which can be invaluable when triaging a problem. For example, if you have +the scheduler running in vagrant, check out these links: +[simple graph](http://192.168.33.7:8081/graphview?query=jvm_uptime_secs) +[complex composition](http://192.168.33.7:8081/graphview?query=rate\(scheduler_log_native_append_nanos_total\)%2Frate\(scheduler_log_native_append_events\)%2F1e6) + +### Counters and gauges +Among numeric stats, there are two fundamental types of stats exported: _counters_ and _gauges_. +Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges +may decrease in value. Aurora uses counters to represent things like the number of times an event +has occurred, and gauges to capture things like the current length of a queue. Counters are a +natural fit for accurate composition into [rate ratios](http://en.wikipedia.org/wiki/Rate_ratio) +(useful for sample-resistant latency calculation), while gauges are not. + +# Alerting + +## Quickstart +If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting +on `framework_registered` and `task_store_LOST`. These will give you a decent picture of overall +health. + +## A note on thresholds +One of the most difficult things in monitoring is choosing alert thresholds. With many of these +stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It +will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We +recommend you start with a strict value after viewing a small amount of collected data, and then +adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts +and thresholds make sense. + +## Important stats + +### `jvm_uptime_secs` +Type: integer counter + +The number of seconds the JVM process has been running. Comes from +[RuntimeMXBean#getUptime()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime\(\)) + +Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to +stay alive. + +Look at the scheduler logs to identify the reason the scheduler is exiting. + +### `system_load_avg` +Type: double gauge + +The current load average of the system for the last minute. Comes from +[OperatingSystemMXBean#getSystemLoadAverage()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage\(\)). + +A high sustained value suggests that the scheduler machine may be over-utilized. + +Use standard unix tools like `top` and `ps` to track down the offending process(es). + +### `process_cpu_cores_utilized` +Type: double gauge + +The current number of CPU cores in use by the JVM process. This should not exceed the number of +logical CPU cores on the machine. Derived from +[OperatingSystemMXBean#getProcessCpuTime()](http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html) + +A high sustained value indicates that the scheduler is overworked. Due to current internal design +limitations, if this value is sustained at `1`, there is a good chance the scheduler is under water. + +There are two main inputs that tend to drive this figure: task scheduling attempts and status +updates from Mesos. You may see activity in the scheduler logs to give an indication of where +time is being spent. Beyond that, it really takes good familiarity with the code to effectively +triage this. We suggest engaging with an Aurora developer. + +### `task_store_LOST` +Type: integer gauge + +The number of tasks stored in the scheduler that are in the `LOST` state, and have been rescheduled. + +If this value is increasing at a high rate, it is a sign of trouble. + +There are many sources of `LOST` tasks in Mesos: the scheduler, master, slave, and executor can all +trigger this. The first step is to look in the scheduler logs for `LOST` to identify where the +state changes are originating. + +### `scheduler_resource_offers` +Type: integer counter + +The number of resource offers that the scheduler has received. + +For a healthy scheduler, this value must be increasing over time. + +Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it +is sending offers. You should also look at the master's web interface to see if it has a large +number of outstanding offers that it is waiting to be returned. + +### `framework_registered` +Type: binary integer counter + +Will be `1` for the leading scheduler that is registered with the Mesos master, `0` for passive +schedulers, + +A sustained period without a `1` (or where `sum() != 1`) warrants investigation. + +If there is no leading scheduler, look in the scheduler and master logs for why. If there are +multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical +bug. + +### `rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)` +Type: rate ratio of integer counters + +This composes two counters to compute a windowed figure for the latency of replicated log writes. + +A hike in this value suggests disk bandwidth contention. + +Look in scheduler logs for any reported oddness with saving to the replicated log. Also use +standard tools like `vmstat` and `iotop` to identify whether the disk has become slow or +over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this. + +### `timed_out_tasks` +Type: integer counter + +Tracks the number of times the scheduler has given up while waiting +(for `-transient_task_state_timeout`) to hear back about a task that is in a transient state +(e.g. `ASSIGNED`, `KILLING`), and has moved to `LOST` before rescheduling. + +This value is currently known to increase occasionally when the scheduler fails over +([AURORA-740](https://issues.apache.org/jira/browse/AURORA-740)). However, any large spike in this +value warrants investigation. + +The scheduler will log when it times out a task. You should trace the task ID of the timed out +task into the master, slave, and/or executors to determine where the message was dropped. + +### `http_500_responses_events` +Type: integer counter + +The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving. + +An increase warrants investigation. + +Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.