aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wfar...@apache.org
Subject git commit: Document common problems and solutions when creating a new cluster.
Date Mon, 20 Oct 2014 19:36:14 GMT
Repository: incubator-aurora
Updated Branches:
  refs/heads/master c6d0d78f1 -> 0ddc4bc4c


Document common problems and solutions when creating a new cluster.

Bugs closed: AURORA-840

Reviewed at https://reviews.apache.org/r/26832/


Project: http://git-wip-us.apache.org/repos/asf/incubator-aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-aurora/commit/0ddc4bc4
Tree: http://git-wip-us.apache.org/repos/asf/incubator-aurora/tree/0ddc4bc4
Diff: http://git-wip-us.apache.org/repos/asf/incubator-aurora/diff/0ddc4bc4

Branch: refs/heads/master
Commit: 0ddc4bc4c62dd0c4e5f2a16aa2109285b879296e
Parents: c6d0d78
Author: Bill Farner <wfarner@apache.org>
Authored: Mon Oct 20 12:35:54 2014 -0700
Committer: Bill Farner <wfarner@apache.org>
Committed: Mon Oct 20 12:35:54 2014 -0700

----------------------------------------------------------------------
 docs/deploying-aurora-scheduler.md | 125 ++++++++++++++++++++++++++------
 1 file changed, 104 insertions(+), 21 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-aurora/blob/0ddc4bc4/docs/deploying-aurora-scheduler.md
----------------------------------------------------------------------
diff --git a/docs/deploying-aurora-scheduler.md b/docs/deploying-aurora-scheduler.md
index 380577e..9b89335 100644
--- a/docs/deploying-aurora-scheduler.md
+++ b/docs/deploying-aurora-scheduler.md
@@ -1,12 +1,41 @@
-The Aurora scheduler is responsible for scheduling new jobs, rescheduling failed jobs, and
killing
-old jobs.
+# Deploying the Aurora Scheduler
+
+When setting up your cluster, you will install the scheduler on a small number (usually 3
or 5) of
+machines.  This guide helps you get the scheduler set up and troubleshoot some common hurdles.
+
+- [Installing Aurora](#installing-aurora)
+  - [Creating the Distribution .zip File (Optional)](#creating-the-distribution-zip-file-optional)
+  - [Installing Aurora](#installing-aurora-1)
+- [Configuring Aurora](#configuring-aurora)
+  - [A Note on Configuration](#a-note-on-configuration)
+  - [Replicated Log Configuration](#replicated-log-configuration)
+  - [Initializing the Replicated Log](#initializing-the-replicated-log)
+  - [Storage Performance Considerations](#storage-performance-considerations)
+  - [Network considerations](#network-considerations)
+- [Running Aurora](#running-aurora)
+  - [Maintaining an Aurora Installation](#maintaining-an-aurora-installation)
+  - [Monitoring](#monitoring)
+  - [Running stateful services](#running-stateful-services)
+    - [Dedicated attribute](#dedicated-attribute)
+      - [Syntax](#syntax)
+      - [Example](#example)
+- [Common problems](#common-problems)
+  - [Replicated log not initialized](#replicated-log-not-initialized)
+    - [Symptoms](#symptoms)
+    - [Solution](#solution)
+  - [Scheduler not registered](#scheduler-not-registered)
+    - [Symptoms](#symptoms-1)
+    - [Solution](#solution-1)
+  - [Tasks are stuck in PENDING forever](#tasks-are-stuck-in-pending-forever)
+    - [Symptoms](#symptoms-2)
+    - [Solution](#solution-2)
 
-# Installing Aurora
-Aurora is a standalone Java server. As part of the build process it creates a bundle of all
its
-dependencies, with the notable exceptions of the JVM and libmesos. Each target server should
have
-a JVM (Java 7 or higher) and libmesos (0.18.0) installed.
+## Installing Aurora
+The Aurora scheduler is a standalone Java server. As part of the build process it creates
a bundle
+of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target
server
+should have a JVM (Java 7 or higher) and libmesos (0.20.0) installed.
 
-## Creating the Distribution .zip File (Optional)
+### Creating the Distribution .zip File (Optional)
 To create a distribution for installation you will need build tools installed. On Ubuntu
this can be
 done with `sudo apt-get install build-essential default-jdk`.
 
@@ -16,16 +45,16 @@ done with `sudo apt-get install build-essential default-jdk`.
 
 Copy the generated `dist/distributions/aurora-scheduler-*.zip` to each node that will run
a scheduler.
 
-## Installing Aurora
+### Installing Aurora
 Extract the aurora-scheduler zip file. The example configurations assume it is extracted
to
 `/usr/local/aurora-scheduler`.
 
     sudo unzip dist/distributions/aurora-scheduler-*.zip -d /usr/local
     sudo ln -nfs "$(ls -dt /usr/local/aurora-scheduler-* | head -1)" /usr/local/aurora-scheduler
 
-# Configuring Aurora
+## Configuring Aurora
 
-## A Note on Configuration
+### A Note on Configuration
 Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora
 "configuration file" is typically a `scheduler.sh` shell script of the form.
 
@@ -59,7 +88,7 @@ documentation run
 
     /usr/local/aurora-scheduler/bin/aurora-scheduler -help
 
-## Replicated Log Configuration
+### Replicated Log Configuration
 All Aurora state is persisted to a replicated log. This includes all jobs Aurora is running
 including where in the cluster they are being run and the configuration for running them,
as
 well as other information such as metadata needed to reconnect to the Mesos master, resource
@@ -83,7 +112,7 @@ should be set to `2`, and in a cluster of 5 it should be set to `3`.
 
 *Incorrectly setting this flag will cause data corruption to occur!*
 
-## Initializing the Replicated Log
+### Initializing the Replicated Log
 Before you start Aurora you will also need to initialize the log on the first master.
 
     mesos-log initialize --path="$AURORA_HOME/scheduler/db"
@@ -92,11 +121,11 @@ Failing to do this will result the following message when you try to
start the s
 
     Replica in EMPTY status received a broadcasted recover request
 
-## Storage Performance Considerations
+### Storage Performance Considerations
 
 See [this document](scheduler-storage.md) for scheduler storage performance considerations.
 
-## Network considerations
+### Network considerations
 The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI,
 and a libprocess (HTTP+Protobuf) port used to communicate with the Mesos master and for the
log
 replication protocol. These can be left unconfigured (the scheduler publishes all selected
ports
@@ -112,7 +141,7 @@ to ZooKeeper) or explicitly set in the startup script as follows:
     export LIBPROCESS_PORT=8083
     # ...
 
-# Running Aurora
+## Running Aurora
 Configure a supervisor like [Monit](http://mmonit.com/monit/) or
 [supervisord](http://supervisord.org/) to run the created `scheduler.sh` file and restart
it
 whenever it fails. Aurora expects to be restarted by an external process when it fails. Aurora
@@ -126,16 +155,16 @@ For example, monit can be configured with
 
 assuming you set `-http_port=8081`.
 
-## Maintaining an Aurora Installation
+### Maintaining an Aurora Installation
 
-## Monitoring
+### Monitoring
 Please see our dedicated [monitoring guide](monitoring.md) for in-depth discussion on monitoring.
 
-## Running stateful services
+### Running stateful services
 Aurora is best suited to run stateless applications, but it also accommodates for stateful
services
 like databases, or services that otherwise need to always run on the same machines.
 
-### Dedicated attribute
+#### Dedicated attribute
 The Mesos slave has the `--attributes` command line argument which can be used to mark a
slave with
 static attributes (not to be confused with `--resources`, which are dynamic and accounted).
 
@@ -145,14 +174,14 @@ constraints are arbitrary and available for custom use.  There is one
exception,
 `dedicated` attribute.  Aurora treats this specially, and only allows matching jobs to run
on these
 machines, and will only schedule matching jobs on these machines.
 
-#### Syntax
+##### Syntax
 The dedicated attribute has semantic meaning. The format is `$role(/.*)?`. When a job is
created,
 the scheduler requires that the `$role` component matches the `role` field in the job
 configuration, and will reject the job creation otherwise.  The remainder of the attribute
is
 free-form. We've developed the idiom of formatting this attribute as `$role/$job`, but do
not
 enforce this.
 
-#### Example
+##### Example
 Consider the following slave command line:
 
     mesos-slave --attributes="host:$HOST;rack:$RACK;dedicated:db_team/redis" ...
@@ -171,3 +200,57 @@ And this job configuration:
 The job configuration is indicating that it should only be scheduled on slaves with the attribute
 `dedicated:dba_team/redis`.  Additionally, Aurora will prevent any tasks that do _not_ have
that
 constraint from running on those slaves.
+
+
+## Common problems
+So you've started your first cluster and are running into some issues? We've collected some
common
+stumbling blocks and solutions here to help get you moving.
+
+### Replicated log not initialized
+
+#### Symptoms
+- Scheduler RPCs and web interface claim `Storage is not READY`
+- Scheduler log repeatedly prints messages like
+
+  ```
+  I1016 16:12:27.234133 26081 replica.cpp:638] Replica in EMPTY status
+  received a broadcasted recover request
+  I1016 16:12:27.234256 26084 recover.cpp:188] Received a recover response
+  from a replica in EMPTY status
+  ```
+
+#### Solution
+When you create a new cluster, you need to inform a quorum of schedulers that they are safe
to
+consider their database to be empty by [initializing](#initializing-the-replicated-log) the
+replicated log. This is done to prevent the scheduler from modifying the cluster state in
the event
+of multiple simultaneous disk failures or, more likely, misconfiguration of the replicated
log path.
+
+### Scheduler not registered
+
+#### Symptoms
+Scheduler log contains
+
+    Framework has not been registered within the tolerated delay.
+
+#### Solution
+Double-check that the scheduler is configured correctly to reach the master. If you are registering
+the master in ZooKeeper, make sure command line argument to the master:
+
+    --zk=zk://$ZK_HOST:2181/mesos/master
+
+is the same as the one on the scheduler:
+
+    -mesos_master_address=zk://$ZK_HOST:2181/mesos/master
+
+### Tasks are stuck in `PENDING` forever
+
+#### Symptoms
+The scheduler is registered, and (receiving offers](docs/monitoring.md#scheduler_resource_offers),
+but tasks are perpetually shown as `PENDING - Constraint not satisfied: host`.
+
+#### Solution
+Check that your slaves are configured with `host` and `rack` attributes.  Aurora requires
that
+slaves are tagged with these two common failure domains to ensure that it can safely place
tasks
+such that jobs are resilient to failure.
+
+See our [vagrant example](examples/vagrant/upstart/mesos-slave.conf) for details.


Mime
View raw message