; Tue, 21 Feb 2017 20:55:11 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1783940 [8/20] - in /aurora/site: data/ publish/ publish/blog/ publish/blog/aurora-0-17-0-released/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publish... Date: Tue, 21 Feb 2017 20:55:06 -0000 To: commits@aurora.apache.org From: serb@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20170221205511.75CF03A3C56@svn01-us-west.apache.org> archived-at: Tue, 21 Feb 2017 20:55:16 -0000 Added: aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html?rev=1783940&view=auto ============================================================================== --- aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html (added) +++ aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html Tue Feb 21 20:54:58 2017 @@ -0,0 +1,230 @@ + + + + + + Apache Aurora + + + + + + +

+ +

Documentation + +

Recovering from a Scheduler Backup

+ +

Be sure to read the entire page before attempting to restore from a backup, as it may have +unintended consequences.

+ +

Summary

+ +

The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an +earlier, backed up, version and requires all schedulers to be taken down temporarily while +restoring. Once completed, the scheduler state resets to what it was when the backup was created. +This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will +be killed shortly after the cluster restarts. All other tasks continue operating as normal.

+ +

Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few +hours). This is because the scheduler will expect the cluster to look exactly as the backup does, +so any tasks that have been rescheduled since the backup was taken will be killed.

+ +

Instructions below have been verified in Vagrant environment and with minor +syntax/path changes should be applicable to any Aurora cluster.

+ +

Preparation

+ +

Follow these steps to prepare the cluster for restoring from a backup:

+ +

Stop all scheduler instances
Consider blocking external traffic on a port defined in -http_port for all schedulers to +prevent users from interacting with the scheduler during the restoration process. This will help +troubleshooting by reducing the scheduler log noise and prevent users from making changes that will +be erased after the backup snapshot is restored.
Configure aurora_admin access to run all commands listed in +Restore from backup section locally on the leading scheduler:
+ +
- Make sure the clusters.json file configured to +access scheduler directly. Set scheduler_uri setting and remove zk. Since leader can get +re-elected during the restore steps, consider doing it on all scheduler replicas.
- Depending on your particular security approach you will need to either turn off scheduler +authorization by removing scheduler -http_authentication_mechanism flag or make sure the +direct scheduler access is properly authorized. E.g.: in case of Kerberos you will need to make +a /etc/hosts file change to match your local IP to the scheduler URL configured in keytabs:
  + +
Next steps are required to put scheduler into a partially disabled state where it would still be +able to accept storage recovery requests but unable to schedule or change task states. This may be +accomplished by updating the following scheduler configuration options:
+ +
- Set -mesos_master_address to a non-existent zk address. This will prevent scheduler from +registering with Mesos. E.g.: -mesos_master_address=zk://localhost:1111/mesos/master
- -max_registration_delay - set to sufficiently long interval to prevent registration timeout +and as a result scheduler suicide. E.g: -max_registration_delay=360mins
- Make sure -reconciliation_initial_delay option is set high enough (e.g.: 365days) to +prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster +state and will kill all tasks when restarted with an empty Mesos replicated log.
Restart all schedulers

+ +

Cleanup and re-initialize Mesos replicated log

+ +

Get rid of the corrupted files and re-initialize Mesos replicated log:

+ +

Stop schedulers
Delete all files under -native_log_file_path on all schedulers
Initialize Mesos replica’s log file: sudo mesos-log initialize --path=<-native_log_file_path>
Start schedulers

+ +

Restore from backup

+ +

At this point the scheduler is ready to rehydrate from the backup:

+ +

Identify the leading scheduler by:
+ +
- examining the scheduler_lifecycle_LEADER_AWAITING_REGISTRATION metric at the scheduler +/vars endpoint. Leader will have 1. All other replicas - 0.
- examining scheduler logs
- or examining Zookeeper registration under the path defined by -zk_endpoints +and -serverset_path
Locate the desired backup file, copy it to the leading scheduler’s -backup_dir folder and stage +recovery by running the following command on a leader +aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>
At this point, the recovery snapshot is staged and available for manual verification/modification +via aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect and +scheduler_delete_recovery_tasks --bypass-leader-redirect commands. +See aurora_admin help <command> for usage details.
Commit recovery. This instructs the scheduler to overwrite the existing Mesos replicated log with +the provided backup snapshot and initiate a mandatory failover +aurora_admin scheduler_commit_recovery --bypass-leader-redirect <cluster>

+ +

Cleanup

+ +

Undo any modification done during Preparation sequence.

+ +

Quick Links

The ASF

© 2014-2017 Apache Software Foundation. Licensed under the Apache License v2.0. The Aurora Borealis IX photo displayed on the homepage is available under a Creative Commons BY-NC-ND 2.0 license. Apache, Apache Aurora, and the Apache feather logo are trademarks of The Apache Software Foundation.

+ + + Added: aurora/site/publish/documentation/0.17.0/operations/configuration/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.17.0/operations/configuration/index.html?rev=1783940&view=auto ============================================================================== --- aurora/site/publish/documentation/0.17.0/operations/configuration/index.html (added) +++ aurora/site/publish/documentation/0.17.0/operations/configuration/index.html Tue Feb 21 20:54:58 2017 @@ -0,0 +1,403 @@ + + + + + + Apache Aurora + + + + + + +

+ +

Documentation + +

Scheduler Configuration

+ +

The Aurora scheduler can take a variety of configuration options through command-line arguments. +Examples are available under examples/scheduler/. For a list of available Aurora flags and their +documentation, see Scheduler Configuration Reference.

+ +

A Note on Configuration

+ +

Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora +“configuration file” is typically a scheduler.sh shell script of the form.

#!/bin/bash
+AURORA_HOME=/usr/local/aurora-scheduler
+
+# Flags controlling the JVM.
+JAVA_OPTS=(
+  -Xmx2g
+  -Xms2g
+  # GC tuning, etc.
+)
+
+# Flags controlling the scheduler.
+AURORA_FLAGS=(
+  # Port for client RPCs and the web UI
+  -http_port=8081
+  # Log configuration, etc.
+)
+
+# Environment variables controlling libmesos
+export JAVA_HOME=...
+export GLOG_v=1
+# Port and public ip used to communicate with the Mesos master and for the replicated log
+export LIBPROCESS_PORT=8083
+export LIBPROCESS_IP=192.168.33.7
+
+JAVA_OPTS="${JAVA_OPTS[*]}" exec "$AURORA_HOME/bin/aurora-scheduler" "${AURORA_FLAGS[@]}"
+

+ +

That way Aurora’s current flags are visible in ps and in the /vars admin endpoint.

+ +

Replicated Log Configuration

+ +

Aurora schedulers use ZooKeeper to discover log replicas and elect a leader. Only one scheduler is +leader at a given time - the other schedulers follow log writes and prepare to take over as leader +but do not communicate with the Mesos master. Either 3 or 5 schedulers are recommended in a +production deployment depending on failure tolerance and they must have persistent storage.

+ +

Below is a summary of scheduler storage configuration flags that either don’t have default values +or require attention before deploying in a production environment.

+ +

`-native_log_quorum_size`

+ +

Defines the Mesos replicated log quorum size. In a cluster with N schedulers, the flag +-native_log_quorum_size should be set to floor(N/2) + 1. So in a cluster with 1 scheduler +it should be set to 1, in a cluster with 3 it should be set to 2, and in a cluster of 5 it +should be set to 3.

+ + + + + + + + + + + + + + + + + + + + + + + +

Number of schedulers (N)	`-native_log_quorum_size` setting (`floor(N/2) + 1`)
1	1
3	2
5	3
7	4

+ +

Incorrectly setting this flag will cause data corruption to occur!

+ +

`-native_log_file_path`

+ +

Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD) +for Mesos replicated log files to ensure optimal storage performance.

+ +

`-native_log_zk_group_path`

+ +

ZooKeeper path used for Mesos replicated log quorum discovery.

+ +

See code for +other available Mesos replicated log configuration options and default values.

+ +

Changing the Quorum Size

+ +

Special care needs to be taken when changing the size of the Aurora scheduler quorum. +Since Aurora uses a Mesos replicated log, similar steps need to be followed as when +changing the Mesos quorum size.

+ +

As a preparation, increase -native_log_quorum_size on each existing scheduler and restart them. +When updating from 3 to 5 schedulers, the quorum size would grow from 2 to 3.

+ +

When starting the new schedulers, use the -native_log_quorum_size set to the new value. Failing to +first increase the quorum size on running schedulers can in some cases result in corruption +or truncating of the replicated log used by Aurora. In that case, see the documentation on +recovering from backup.

+ +

Backup Configuration

+ +

Configuration options for the Aurora scheduler backup manager.

+ +

-backup_interval: The interval on which the scheduler writes local storage backups. The default is every hour.
-backup_dir: Directory to write backups to.
-max_saved_backups: Maximum number of backups to retain before deleting the oldest backup(s).

+ +

Resource Isolation

+ +

For proper CPU, memory, and disk isolation as mentioned in our enduser documentation, +we recommend to add the following isolators to the --isolation flag of the Mesos agent:

+ +

cgroups/cpu
cgroups/mem
disk/du

+ +

In addition, we recommend to set the following agent flags:

+ +

--cgroups_limit_swap to enable memory limits on both memory and swap instead of just memory. +Alternatively, you could disable swap on your agent hosts.
--cgroups_enable_cfs to enable hard limits on CPU resources via the CFS bandwidth limiting +feature.
--enforce_container_disk_quota to enable disk quota enforcement for containers.

+ +

To enable the optional GPU support in Mesos, please see the GPU related flags in the +Mesos configuration. +To enable the corresponding feature in Aurora, you have to start the scheduler with the +flag

-allow_gpu_resource=true
+

+ +

If you want to use revocable resources, first follow the +Mesos oversubscription documentation +and then set set this Aurora scheduler flag to allow receiving revocable Mesos offers:

-receive_revocable_resources=true
+

+ +

Both CPUs and RAM are supported as revocable resources. The former is enabled by the default, +the latter needs to be enabled via:

-enable_revocable_ram=true
+

+ +

Unless you want to use the default +tier configuration, you will also have to specify a file path:

-tier_config=path/to/tiers/config.json
+

+ +

Containers

+ +

Both the Mesos and Docker containerizers require configuration of the Mesos agent.

+ +

Mesos Containerizer

+ +

The minimal agent configuration requires to enable Docker and Appc image support for the Mesos +containerizer:

--containerizers=mesos
+--image_providers=appc,docker
+--isolation=filesystem/linux,docker/runtime  # as an addition to your other isolators
+

+ +

Further details can be found in the corresponding Mesos documentation.

+ +

Docker Containerizer

+ +

The Docker containerizer +requires the Docker engine is installed on each agent host. In addition, it must be enabled on the +Mesos agents by launching them with the option:

--containerizers=mesos,docker
+

+ +

If you would like to run a container with a read-only filesystem, it may also be necessary to use +the scheduler flag -thermos_home_in_sandbox in order to set HOME to the sandbox +before the executor runs. This will make sure that the executor/runner PEX extractions happens +inside of the sandbox instead of the container filesystem root.

+ +

If you would like to supply your own parameters to docker run when launching jobs in docker +containers, you may use the following flags:

-allow_docker_parameters
+-default_docker_parameters
+

+ +

-allow_docker_parameters controls whether or not users may pass their own configuration parameters +through the job configuration files. If set to false (the default), the scheduler will reject +jobs with custom parameters. NOTE: this setting should be used with caution as it allows any job +owner to specify any parameters they wish, including those that may introduce security concerns +(privileged=true, for example).

+ +

-default_docker_parameters allows a cluster operator to specify a universal set of parameters that +should be used for every container that does not have parameters explicitly configured at the job +level. The argument accepts a multimap format:

-default_docker_parameters="read-only=true,tmpfs=/tmp,tmpfs=/run"
+

+ +

Common Options

+ +

The following Aurora options work for both containerizers.

+ +

A scheduler flag, -global_container_mounts allows mounting paths from the host (i.e the agent machine) +into all containers on that host. The format is a comma separated list of hostpath:containerpath[:mode] +tuples. For example -global_container_mounts=/opt/secret_keys_dir:/mnt/secret_keys_dir:ro mounts +/opt/secret_keys_dir from the agents into all launched containers. Valid modes are ro and rw.

+ +

Thermos Process Logs

+ +

Log destination

+ +

By default, Thermos will write process stdout/stderr to log files in the sandbox. Process object +configuration allows specifying alternate log file destinations like streamed stdout/stderr or +suppression of all log output. Default behavior can be configured for the entire cluster with the +following flag (through the -thermos_executor_flags argument to the Aurora scheduler):

--runner-logger-destination=both
+

+ +

both configuration will send logs to files and stream to parent stdout/stderr outputs.

+ +

See Configuration Reference for all destination options.

+ +

Log rotation

+ +

By default, Thermos will not rotate the stdout/stderr logs from child processes and they will grow +without bound. An individual user may change this behavior via configuration on the Process object, +but it may also be desirable to change the default configuration for the entire cluster. +In order to enable rotation by default, the following flags can be applied to Thermos (through the +-thermos_executor_flags argument to the Aurora scheduler):

--runner-logger-mode=rotate
+--runner-rotate-log-size-mb=100
+--runner-rotate-log-backups=10
+

+ +

In the above example, each instance of the Thermos runner will rotate stderr/stdout logs once they +reach 100 MiB in size and keep a maximum of 10 backups. If a user has provided a custom setting for +their process, it will override these default settings.

+ +

Thermos Executor Wrapper

+ +

If you need to do computation before starting the Thermos executor (for example, setting a different +--announcer-hostname parameter for every executor), then the Thermos executor should be invoked +inside a wrapper script. In such a case, the aurora scheduler should be started with +-thermos_executor_path pointing to the wrapper script and -thermos_executor_resources set to a +comma separated string of all the resources that should be copied into the sandbox (including the +original Thermos executor). Ensure the wrapper script does not access resources outside of the +sandbox, as when the script is run from within a Docker container those resources may not exist.

+ +

For example, to wrap the executor inside a simple wrapper, the scheduler will be started like this +-thermos_executor_path=/path/to/wrapper.sh -thermos_executor_resources=/usr/share/aurora/bin/thermos_executor.pex

+ +

Custom Executors

+ +

The scheduler can be configured to utilize a custom executor by specifying the -custom_executor_config flag. +The flag must be set to the path of a valid executor configuration file.

+ +

For more information on this feature please see the custom executors documentation.

+ +

A note on increasing executor overhead

+ +

Increasing executor overhead on an existing cluster, whether it be for custom executors or for Thermos, +will result in degraded preemption performance until all task which began life with the previous +executor configuration with less overhead are preempted/restarted.

+ +

Quick Links

The ASF

+ + + Added: aurora/site/publish/documentation/0.17.0/operations/installation/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.17.0/operations/installation/index.html?rev=1783940&view=auto ============================================================================== --- aurora/site/publish/documentation/0.17.0/operations/installation/index.html (added) +++ aurora/site/publish/documentation/0.17.0/operations/installation/index.html Tue Feb 21 20:54:58 2017 @@ -0,0 +1,455 @@ + + + + + + Apache Aurora + + + + + + +

+ +

Documentation + +

Installing Aurora

+ +

Source and binary distributions can be found on our +downloads page. Installing from binary packages is +recommended for most.

+ +

Installing the scheduler
Installing worker components
Installing the client
Installing Mesos
Troubleshooting

+ +

If our binay packages don’t suite you, our package build toolchain makes it easy to build your +own packages. See the instructions to learn how.

+ +

Machine profiles

+ +

Given that many of these components communicate over the network, there are numerous ways you could +assemble them to create an Aurora cluster. The simplest way is to think in terms of three machine +profiles:

+ +

Coordinator

+ +

Components: ZooKeeper, Aurora scheduler, Mesos master

+ +

A small number of machines (typically 3 or 5) responsible for cluster orchestration. In most cases +it is fine to co-locate these components in anything but very large clusters (> 1000 machines). +Beyond that point, operators will likely want to manage these services on separate machines.

+ +

In practice, 5 coordinators have been shown to reliably manage clusters with tens of thousands of +machines.

+ +

Worker

+ +

Components: Aurora executor, Aurora observer, Mesos agent

+ +

The bulk of the cluster, where services will actually run.

+ +

Client

+ +

Components: Aurora client, Aurora admin client

+ +

Any machines that users submit jobs from.

+ +

Installing the scheduler

+ +

Ubuntu Trusty

+ +

Install Mesos +Skip down to install mesos, then run:
+
```
sudo start mesos-master
+
```
Install ZooKeeper
+
```
sudo apt-get install -y zookeeperd
+
```

Install the Aurora scheduler

sudo add-apt-repository -y ppa:openjdk-r/ppa
+sudo apt-get update
+sudo apt-get install -y openjdk-8-jre-headless wget
+
+sudo update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
+
+wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-scheduler_0.17.0_amd64.deb
+sudo dpkg -i aurora-scheduler_0.17.0_amd64.deb
+

+ +

CentOS 7

+ +

Install Mesos +Skip down to install mesos, then run:
+
```
sudo systemctl start mesos-master
+
```

Install ZooKeeper

sudo rpm -Uvh https://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
+sudo yum install -y java-1.8.0-openjdk-headless zookeeper-server
+
+sudo service zookeeper-server init
+sudo systemctl start zookeeper-server
+

Install the Aurora scheduler

sudo yum install -y wget
+
+wget -c https://apache.bintray.com/aurora/centos-7/aurora-scheduler-0.17.0-1.el7.centos.aurora.x86_64.rpm
+sudo yum install -y aurora-scheduler-0.17.0-1.el7.centos.aurora.x86_64.rpm
+

+ +

Finalizing

+ +

By default, the scheduler will start in an uninitialized mode. This is because external +coordination is necessary to be certain operator error does not result in a quorum of schedulers +starting up and believing their databases are empty when in fact they should be re-joining a +cluster.

+ +

Because of this, a fresh install of the scheduler will need intervention to start up. First, +stop the scheduler service. +Ubuntu: sudo stop aurora-scheduler +CentOS: sudo systemctl stop aurora

+ +

Now initialize the database:

sudo -u aurora mkdir -p /var/lib/aurora/scheduler/db
+sudo -u aurora mesos-log initialize --path=/var/lib/aurora/scheduler/db
+

+ +

Now you can start the scheduler back up. +Ubuntu: sudo start aurora-scheduler +CentOS: sudo systemctl start aurora

+ +

Installing worker components

+ +

Ubuntu Trusty

+ +

Install Mesos +Skip down to install mesos, then run:
+
```
start mesos-slave
+
```

Install Aurora executor and observer

sudo apt-get install -y python2.7 wget
+
+# NOTE: This appears to be a missing dependency of the mesos deb package and is needed
+# for the python mesos native bindings.
+sudo apt-get -y install libcurl4-nss-dev
+
+wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-executor_0.17.0_amd64.deb
+sudo dpkg -i aurora-executor_0.17.0_amd64.deb
+

+ +

CentOS 7

+ +

Install Mesos +Skip down to install mesos, then run:
+
```
sudo systemctl start mesos-slave
+
```

Install Aurora executor and observer

sudo yum install -y python2 wget
+
+wget -c https://apache.bintray.com/aurora/centos-7/aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm
+sudo yum install -y aurora-executor-0.17.0-1.el7.centos.aurora.x86_64.rpm
+

+ +

Configuration

+ +

The executor typically does not require configuration. Command line arguments can +be passed to the executor using a command line argument on the scheduler.

+ +

The observer needs to be configured to look at the correct mesos directory in order to find task +sandboxes. You should 1st find the Mesos working directory by looking for the Mesos agent +--work_dir flag. You should see something like:

    ps -eocmd | grep "mesos-slave" | grep -v grep | tr ' ' '\n' | grep "\--work_dir"
+    --work_dir=/var/lib/mesos
+

+ +

If the flag is not set, you can view the default value like so:

    mesos-slave --help
+    Usage: mesos-slave [options]
+
+      ...
+      --work_dir=VALUE      Directory path to place framework work directories
+                            (default: /tmp/mesos)
+      ...
+

+ +

The value you find for --work_dir, /var/lib/mesos in this example, should match the Aurora +observer value for --mesos-root. You can look for that setting in a similar way on a worker +node by grepping for thermos_observer and --mesos-root. If the flag is not set, you can view +the default value like so:

    thermos_observer -h
+    Options:
+      ...
+      --mesos-root=MESOS_ROOT
+                            The mesos root directory to search for Thermos
+                            executor sandboxes [default: /var/lib/mesos]
+      ...
+

+ +

In this case the default is /var/lib/mesos and we have a match. If there is no match, you can +either adjust the mesos-master start script(s) and restart the master(s) or else adjust the +Aurora observer start scripts and restart the observers. To adjust the Aurora observer:

+ +

Ubuntu Trusty

sudo sh -c 'echo "MESOS_ROOT=/tmp/mesos" >> /etc/default/thermos'
+

+ +

CentOS 7

+ +

Make an edit to add the --mesos-root flag resulting in something like:

grep -A5 OBSERVER_ARGS /etc/sysconfig/thermos
+OBSERVER_ARGS=(
+  --port=1338
+  --mesos-root=/tmp/mesos
+  --log_to_disk=NONE
+  --log_to_stderr=google:INFO
+)
+

+ +

Installing the client

+ +

Ubuntu Trusty

sudo apt-get install -y python2.7 wget
+
+wget -c https://apache.bintray.com/aurora/ubuntu-trusty/aurora-tools_0.17.0_amd64.deb
+sudo dpkg -i aurora-tools_0.17.0_amd64.deb
+

+ +

CentOS 7

sudo yum install -y python2 wget
+
+wget -c https://apache.bintray.com/aurora/centos-7/aurora-tools-0.17.0-1.el7.centos.aurora.x86_64.rpm
+sudo yum install -y aurora-tools-0.17.0-1.el7.centos.aurora.x86_64.rpm
+

+ +

Mac OS X

brew upgrade
+brew install aurora-cli
+

+ +

Configuration

+ +

Client configuration lives in a json file that describes the clusters available and how to reach +them. By default this file is at /etc/aurora/clusters.json.

+ +

Jobs may be submitted to the scheduler using the client, and are described with +job configurations expressed in .aurora files. Typically you will +maintain a single job configuration file to describe one or more deployment environments (e.g. +dev, test, prod) for a production job.

+ +

Installing Mesos

+ +

Mesos uses a single package for the Mesos master and agent. As a result, the package dependencies +are identical for both.

+ +

Mesos on Ubuntu Trusty

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF
+DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
+CODENAME=$(lsb_release -cs)
+
+echo "deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main" | \
+  sudo tee /etc/apt/sources.list.d/mesosphere.list
+sudo apt-get -y update
+
+# Use `apt-cache showpkg mesos | grep [version]` to find the exact version.
+sudo apt-get -y install mesos=1.1.0-2.0.107.ubuntu1404_amd64.deb
+

+ +

Mesos on CentOS 7

sudo rpm -Uvh https://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
+sudo yum -y install mesos-1.1.0
+

+ +

Troubleshooting

+ +

So you’ve started your first cluster and are running into some issues? We’ve collected some common +stumbling blocks and solutions here to help get you moving.

+ +

Replicated log not initialized

+ +

Symptoms

+ +

Scheduler RPCs and web interface claim Storage is not READY
Scheduler log repeatedly prints messages like

  I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status
+  received a broadcasted recover request
+  I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response
+  from a replica in EMPTY status
+

+ +

Solution

+ +

When you create a new cluster, you need to inform a quorum of schedulers that they are safe to +consider their database to be empty by initializing the +replicated log. This is done to prevent the scheduler from modifying the cluster state in the event +of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated log path.

+ +

Scheduler not registered

+ +

Symptoms

+ +

Scheduler log contains

Framework has not been registered within the tolerated delay.
+

+ +

Solution

+ +

Double-check that the scheduler is configured correctly to reach the Mesos master. If you are registering +the master in ZooKeeper, make sure command line argument to the master:

--zk=zk://$ZK_HOST:2181/mesos/master
+

+ +

is the same as the one on the scheduler:

-mesos_master_address=zk://$ZK_HOST:2181/mesos/master
+

+ +

Scheduler not running

+ +

Symptom

+ +

The scheduler process commits suicide regularly. This happens under error conditions, but +also on purpose in regular intervals.

+ +

Solution

+ +

Aurora is meant to be run under supervision. You have to configure a supervisor like +Monit or supervisord to run the scheduler +and restart it whenever it fails or exists on purpose.

+ +

Aurora supports an active health checking protocol on its admin HTTP interface - if a GET /health +times out or returns anything other than 200 OK the scheduler process is unhealthy and should be +restarted.

+ +

For example, monit can be configured with

if failed port 8081 send "GET /health HTTP/1.0\r\n" expect "OK\n" with timeout 2 seconds for 10 cycles then restart
+

+ +

assuming you set -http_port=8081.

+ +

Quick Links

The ASF

+ + + Added: aurora/site/publish/documentation/0.17.0/operations/monitoring/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.17.0/operations/monitoring/index.html?rev=1783940&view=auto ============================================================================== --- aurora/site/publish/documentation/0.17.0/operations/monitoring/index.html (added) +++ aurora/site/publish/documentation/0.17.0/operations/monitoring/index.html Tue Feb 21 20:54:58 2017 @@ -0,0 +1,325 @@ + + + + + + Apache Aurora + + + + + + +

+ +

Documentation + +

Monitoring your Aurora cluster

+ +

Before you start running important services in your Aurora cluster, it’s important to set up +monitoring and alerting of Aurora itself. Most of your monitoring can be against the scheduler, +since it will give you a global view of what’s going on.

+ +

Reading stats

+ +

The scheduler exposes a lot of instrumentation data via its HTTP interface. You can get a quick +peek at the first few of these in our vagrant image:

$ vagrant ssh -c 'curl -s localhost:8081/vars | head'
+async_tasks_completed 1004
+attribute_store_fetch_all_events 15
+attribute_store_fetch_all_events_per_sec 0.0
+attribute_store_fetch_all_nanos_per_event 0.0
+attribute_store_fetch_all_nanos_total 3048285
+attribute_store_fetch_all_nanos_total_per_sec 0.0
+attribute_store_fetch_one_events 3391
+attribute_store_fetch_one_events_per_sec 0.0
+attribute_store_fetch_one_nanos_per_event 0.0
+attribute_store_fetch_one_nanos_total 454690753
+

+ +

These values are served as Content-Type: text/plain, with each line containing a space-separated metric +name and value. Values may be integers, doubles, or strings (note: strings are static, others +may be dynamic).

+ +

If your monitoring infrastructure prefers JSON, the scheduler exports that as well:

$ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
+{
+    "async_tasks_completed": 1009,
+    "attribute_store_fetch_all_events": 15,
+    "attribute_store_fetch_all_events_per_sec": 0.0,
+    "attribute_store_fetch_all_nanos_per_event": 0.0,
+    "attribute_store_fetch_all_nanos_total": 3048285,
+    "attribute_store_fetch_all_nanos_total_per_sec": 0.0,
+    "attribute_store_fetch_one_events": 3409,
+    "attribute_store_fetch_one_events_per_sec": 0.0,
+    "attribute_store_fetch_one_nanos_per_event": 0.0,
+

+ +

This will be the same data as above, served with Content-Type: application/json.

+ +

Viewing live stat samples on the scheduler

+ +

The scheduler uses the Twitter commons stats library, which keeps an internal time-series database +of exported variables - nearly everything in /vars is available for instant graphing. This is +useful for debugging, but is not a replacement for an external monitoring system.

+ +

You can view these graphs on a scheduler at /graphview. It supports some composition and +aggregation of values, which can be invaluable when triaging a problem. For example, if you have +the scheduler running in vagrant, check out these links: +simple graph +complex composition

+ +

Counters and gauges

+ +

Among numeric stats, there are two fundamental types of stats exported: counters and gauges. +Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges +may decrease in value. Aurora uses counters to represent things like the number of times an event +has occurred, and gauges to capture things like the current length of a queue. Counters are a +natural fit for accurate composition into rate ratios +(useful for sample-resistant latency calculation), while gauges are not.

+ +

Alerting

+ +

Quickstart

+ +

If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting +on framework_registered and task_store_LOST. These will give you a decent picture of overall +health.

+ +

A note on thresholds

+ +

One of the most difficult things in monitoring is choosing alert thresholds. With many of these +stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It +will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We +recommend you start with a strict value after viewing a small amount of collected data, and then +adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts +and thresholds make sense.

+ +

Important stats

+ +

`jvm_uptime_secs`

+ +

Type: integer counter

+ +

The number of seconds the JVM process has been running. Comes from +RuntimeMXBean#getUptime()

+ +

Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to +stay alive.

+ +

Look at the scheduler logs to identify the reason the scheduler is exiting.

+ +

`system_load_avg`

+ +

Type: double gauge

+ +

The current load average of the system for the last minute. Comes from +OperatingSystemMXBean#getSystemLoadAverage().

+ +

A high sustained value suggests that the scheduler machine may be over-utilized.

+ +

Use standard unix tools like top and ps to track down the offending process(es).

+ +

`process_cpu_cores_utilized`

+ +

Type: double gauge

+ +

The current number of CPU cores in use by the JVM process. This should not exceed the number of +logical CPU cores on the machine. Derived from +OperatingSystemMXBean#getProcessCpuTime()

+ +

A high sustained value indicates that the scheduler is overworked. Due to current internal design +limitations, if this value is sustained at 1, there is a good chance the scheduler is under water.

+ +

There are two main inputs that tend to drive this figure: task scheduling attempts and status +updates from Mesos. You may see activity in the scheduler logs to give an indication of where +time is being spent. Beyond that, it really takes good familiarity with the code to effectively +triage this. We suggest engaging with an Aurora developer.

+ +

`task_store_LOST`

+ +

Type: integer gauge

+ +

The number of tasks stored in the scheduler that are in the LOST state, and have been rescheduled.

+ +

If this value is increasing at a high rate, it is a sign of trouble.

+ +

There are many sources of LOST tasks in Mesos: the scheduler, master, agent, and executor can all +trigger this. The first step is to look in the scheduler logs for LOST to identify where the +state changes are originating.

+ +

`scheduler_resource_offers`

+ +

Type: integer counter

+ +

The number of resource offers that the scheduler has received.

+ +

For a healthy scheduler, this value must be increasing over time.

+ +

Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it +is sending offers. You should also look at the master’s web interface to see if it has a large +number of outstanding offers that it is waiting to be returned.

+ +

`framework_registered`

+ +

Type: binary integer counter

+ +

Will be 1 for the leading scheduler that is registered with the Mesos master, 0 for passive +schedulers,

+ +

A sustained period without a 1 (or where sum() != 1) warrants investigation.

+ +

If there is no leading scheduler, look in the scheduler and master logs for why. If there are +multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical +bug.

+ +

`rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`

+ +

Type: rate ratio of integer counters

+ +

This composes two counters to compute a windowed figure for the latency of replicated log writes.

+ +

A hike in this value suggests disk bandwidth contention.

+ +

Look in scheduler logs for any reported oddness with saving to the replicated log. Also use +standard tools like vmstat and iotop to identify whether the disk has become slow or +over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.

+ +

`timed_out_tasks`

+ +

Type: integer counter

+ +

Tracks the number of times the scheduler has given up while waiting +(for -transient_task_state_timeout) to hear back about a task that is in a transient state +(e.g. ASSIGNED, KILLING), and has moved to LOST before rescheduling.

+ +

This value is currently known to increase occasionally when the scheduler fails over +(AURORA-740). However, any large spike in this +value warrants investigation.

+ +

The scheduler will log when it times out a task. You should trace the task ID of the timed out +task into the master, agent, and/or executors to determine where the message was dropped.

+ +

`http_500_responses_events`

+ +

Type: integer counter

+ +

The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.

+ +

An increase warrants investigation.

+ +

Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.

+ +

Quick Links

The ASF

+ + +

Documentation + + + 0.17.0 + (latest) + + + 0.16.0 + + + 0.15.0 + + + 0.14.0 + + + 0.13.0 + + + 0.12.0 + + + 0.11.0 + + + 0.10.0 + + + 0.9.0 + + + 0.8.0 + + + 0.7.0-incubating + + + 0.6.0-incubating + + + 0.5.0-incubating + + +

Recovering from a Scheduler Backup

Summary

Preparation

Cleanup and re-initialize Mesos replicated log

Restore from backup

Cleanup

Quick Links

The ASF

Documentation + + + 0.17.0 + (latest) + + + 0.16.0 + + + 0.15.0 + + + 0.14.0 + + + 0.13.0 + + + 0.12.0 + + + 0.11.0 + + + 0.10.0 + + + 0.9.0 + + + 0.8.0 + + + 0.7.0-incubating + + + 0.6.0-incubating + + + 0.5.0-incubating + + +

Scheduler Configuration

A Note on Configuration

Replicated Log Configuration

-native_log_quorum_size

-native_log_file_path

-native_log_zk_group_path

Changing the Quorum Size

Backup Configuration

Resource Isolation

Containers

Mesos Containerizer

Docker Containerizer

Common Options

Thermos Process Logs

Log destination

Log rotation

Thermos Executor Wrapper

Custom Executors

A note on increasing executor overhead

Quick Links

The ASF

Documentation + + + 0.17.0 + (latest) + + + 0.16.0 + + + 0.15.0 + + + 0.14.0 + + + 0.13.0 + + + 0.12.0 + + + 0.11.0 + + + 0.10.0 + + + 0.9.0 + + + 0.8.0 + + + 0.7.0-incubating + + + 0.6.0-incubating + + + 0.5.0-incubating + + +

Installing Aurora

Machine profiles

Coordinator

Worker

Client

Installing the scheduler

Ubuntu Trusty

CentOS 7

Finalizing

Installing worker components

Ubuntu Trusty

CentOS 7

Configuration

Ubuntu Trusty

CentOS 7

Installing the client

Ubuntu Trusty

CentOS 7

Mac OS X

Configuration

Installing Mesos

Mesos on Ubuntu Trusty

Mesos on CentOS 7

Troubleshooting

Replicated log not initialized

Symptoms

Solution

Scheduler not registered

Symptoms

Solution

Scheduler not running

Symptom

Solution

Quick Links

The ASF

Documentation + + + 0.17.0 + (latest) + + + 0.16.0 + + + 0.15.0 + + + 0.14.0 + + + 0.13.0 + + + 0.12.0 + + + 0.11.0 + + + 0.10.0 + + + 0.9.0 + + + 0.8.0 + + + 0.7.0-incubating + + + 0.6.0-incubating + + + 0.5.0-incubating + + +

Monitoring your Aurora cluster

Reading stats

Viewing live stat samples on the scheduler

Counters and gauges

Alerting

Quickstart

A note on thresholds

Important stats

jvm_uptime_secs

system_load_avg

process_cpu_cores_utilized

task_store_LOST

Documentation + +

Documentation + +

`-native_log_quorum_size`

`-native_log_file_path`

`-native_log_zk_group_path`

Documentation + +

Documentation + +

`jvm_uptime_secs`

`system_load_avg`

`process_cpu_cores_utilized`

`task_store_LOST`

`scheduler_resource_offers`

`framework_registered`

`rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`

`timed_out_tasks`

`http_500_responses_events`