aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dles...@apache.org
Subject svn commit: r1632428 - in /incubator/aurora/site: publish/documentation/latest/clientcommands/ publish/documentation/latest/committers/ publish/documentation/latest/configurationreference/ publish/documentation/latest/configurationtutorial/ publish/doc...
Date Thu, 16 Oct 2014 20:00:27 GMT
Author: dlester
Date: Thu Oct 16 20:00:27 2014
New Revision: 1632428

URL: http://svn.apache.org/r1632428
Log:
Updates website documentation.

Added:
    incubator/aurora/site/publish/documentation/latest/monitoring/
    incubator/aurora/site/publish/documentation/latest/monitoring/index.html
    incubator/aurora/site/publish/documentation/latest/scheduler-storage/
    incubator/aurora/site/publish/documentation/latest/scheduler-storage/index.html
    incubator/aurora/site/source/documentation/latest/monitoring.md
    incubator/aurora/site/source/documentation/latest/scheduler-storage.md
Removed:
    incubator/aurora/site/publish/documentation/latest/clientcommands/
    incubator/aurora/site/publish/documentation/latest/configurationreference/
    incubator/aurora/site/publish/documentation/latest/configurationtutorial/
    incubator/aurora/site/publish/documentation/latest/resourceisolation/
    incubator/aurora/site/publish/documentation/latest/userguide/
Modified:
    incubator/aurora/site/publish/documentation/latest/committers/index.html
    incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
    incubator/aurora/site/source/documentation/latest/committers.md
    incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md

Modified: incubator/aurora/site/publish/documentation/latest/committers/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/committers/index.html?rev=1632428&r1=1632427&r2=1632428&view=diff
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/committers/index.html (original)
+++ incubator/aurora/site/publish/documentation/latest/committers/index.html Thu Oct 16 20:00:27 2014
@@ -79,6 +79,33 @@ email forwarding address at</p>
 <p>The recommended setup is to configure all services (mailing lists, JIRA, ReviewBoard) to send
 emails to your @apache.org email address.</p>
 
+<h2 id="creating-a-gpg-key-for-releases">Creating a gpg key for releases</h2>
+
+<p>In order to create a release candidate you will need a gpg key published to an external key server
+and that key will need to be added to our KEYS file as well.</p>
+
+<ol>
+<li><p>Create a key:</p>
+<pre class="highlight text">       gpg --gen-key
+</pre></li>
+<li><p>Add your gpg key to the Apache Aurora KEYS file:</p>
+<pre class="highlight text">       git clone https://git-wip-us.apache.org/repos/asf/incubator-aurora.git
+       (gpg --list-sigs &lt;KEY ID&gt; &amp;&amp; gpg --armor --export &lt;KEY ID&gt;) &gt;&gt; KEYS
+       git add KEYS &amp;&amp; git commit -m &quot;Adding gpg key for &lt;APACHE ID&gt;&quot;
+       ./rbt post -o -g
+</pre></li>
+<li><p>Publish the key to an external key server:</p>
+<pre class="highlight text">       gpg --keyserver pgp.mit.edu --send-keys &lt;KEY ID&gt;
+</pre></li>
+<li><p>Update the changes to the KEYS file to the Apache Aurora svn dist locations listed below:</p>
+<pre class="highlight text">       https://dist.apache.org/repos/dist/dev/incubator/aurora/KEYS
+       https://dist.apache.org/repos/dist/release/incubator/aurora/KEYS
+</pre></li>
+<li><p>Add your key to git config for use with the release scripts:</p>
+<pre class="highlight text">       git config --global user.signingkey &lt;KEY ID&gt;
+</pre></li>
+</ol>
+
 <h2 id="creating-a-release">Creating a release</h2>
 
 <p>The following will guide you through the steps to create a release candidate, vote, and finally an

Modified: incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html?rev=1632428&r1=1632427&r2=1632428&view=diff
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html (original)
+++ incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html Thu Oct 16 20:00:27 2014
@@ -168,6 +168,18 @@ should be set to <code>2</code>, and in 
 
 <p><em>Incorrectly setting this flag will cause data corruption to occur!</em></p>
 
+<h2 id="initializing-the-replicated-log">Initializing the Replicated Log</h2>
+
+<p>Before you start Aurora you will also need to initialize the log on the first master.</p>
+<pre class="highlight text">mesos-log initialize --path=&quot;$AURORA_HOME/scheduler/db&quot;
+</pre>
+<p>Failing to do this will result the following message when you try to start the scheduler.</p>
+<pre class="highlight text">Replica in EMPTY status received a broadcasted recover request
+</pre>
+<h2 id="storage-performance-considerations">Storage Performance Considerations</h2>
+
+<p>See <a href="/documentation/latest/scheduler-storage/">this document</a> for scheduler storage performance considerations.</p>
+
 <h2 id="network-considerations">Network considerations</h2>
 
 <p>The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI,
@@ -198,13 +210,54 @@ restarted.</p>
 </pre>
 <p>assuming you set <code>-http_port=8081</code>.</p>
 
-<h1 id="maintaining-an-aurora-installation">Maintaining an Aurora Installation</h1>
+<h2 id="maintaining-an-aurora-installation">Maintaining an Aurora Installation</h2>
 
 <h2 id="monitoring">Monitoring</h2>
 
-<p>Aurora exports performance metrics via its HTTP interface <code>/vars</code> and <code>/vars.json</code> contain lots of
-useful data to help debug performance and configuration problems. These are all made available via
-<a href="https://github.com/twitter/commons/tree/master/src/java/com/twitter/commons/http">twitter.common.http</a>.</p>
+<p>Please see our dedicated <a href="/documentation/latest/monitoring/">monitoring guide</a> for in-depth discussion on monitoring.</p>
+
+<h2 id="running-stateful-services">Running stateful services</h2>
+
+<p>Aurora is best suited to run stateless applications, but it also accommodates for stateful services
+like databases, or services that otherwise need to always run on the same machines.</p>
+
+<h3 id="dedicated-attribute">Dedicated attribute</h3>
+
+<p>The Mesos slave has the <code>--attributes</code> command line argument which can be used to mark a slave with
+static attributes (not to be confused with <code>--resources</code>, which are dynamic and accounted).</p>
+
+<p>Aurora makes these attributes available for matching with scheduling
+<a href="configuration-reference.md#specifying-scheduling-constraints">constraints</a>.  Most of these
+constraints are arbitrary and available for custom use.  There is one exception, though: the
+<code>dedicated</code> attribute.  Aurora treats this specially, and only allows matching jobs to run on these
+machines, and will only schedule matching jobs on these machines.</p>
+
+<h4 id="syntax">Syntax</h4>
+
+<p>The dedicated attribute has semantic meaning. The format is <code>$role(/.*)?</code>. When a job is created,
+the scheduler requires that the <code>$role</code> component matches the <code>role</code> field in the job
+configuration, and will reject the job creation otherwise.  The remainder of the attribute is
+free-form. We&rsquo;ve developed the idiom of formatting this attribute as <code>$role/$job</code>, but do not
+enforce this.</p>
+
+<h4 id="example">Example</h4>
+
+<p>Consider the following slave command line:</p>
+<pre class="highlight text">mesos-slave --attributes=&quot;host:$HOST;rack:$RACK;dedicated:db_team/redis&quot; ...
+</pre>
+<p>And this job configuration:</p>
+<pre class="highlight text">Service(
+  name = &#39;redis&#39;,
+  role = &#39;db_team&#39;,
+  constraints = {
+    &#39;dedicated&#39;: &#39;db_team/redis&#39;
+  }
+  ...
+)
+</pre>
+<p>The job configuration is indicating that it should only be scheduled on slaves with the attribute
+<code>dedicated:dba_team/redis</code>.  Additionally, Aurora will prevent any tasks that do <em>not</em> have that
+constraint from running on those slaves.</p>
 
 	  </div>
       <div class="container">

Added: incubator/aurora/site/publish/documentation/latest/monitoring/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/monitoring/index.html?rev=1632428&view=auto
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/monitoring/index.html (added)
+++ incubator/aurora/site/publish/documentation/latest/monitoring/index.html Thu Oct 16 20:00:27 2014
@@ -0,0 +1,352 @@
+<html>
+    <head>
+        <meta charset="utf-8">
+        <title>Apache Aurora</title>
+		    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+		    <meta name="description" content="">
+		    <meta name="author" content="">
+
+		    <link href="/assets/css/bootstrap.css" rel="stylesheet">
+		    <link href="/assets/css/bootstrap-responsive.min.css" rel="stylesheet">
+		    <link href="/assets/css/main.css" rel="stylesheet">
+				
+		    <!-- JS -->
+		    <script type="text/javascript" src="/assets/js/jquery-1.10.1.min.js"></script>
+		    <script type="text/javascript" src="/assets/js/bootstrap-dropdown.js"></script>
+		
+				<!-- Analytics -->
+				<script type="text/javascript">
+					  var _gaq = _gaq || [];
+					  _gaq.push(['_setAccount', 'UA-45879646-1']);
+					  _gaq.push(['_setDomainName', 'apache.org']);
+					  _gaq.push(['_trackPageview']);
+
+					  (function() {
+					    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+					    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+					    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+					  })();
+				</script>
+	</head>
+    <body>	
+      <div class="navbar navbar-static-top">
+  <div class="navbar-inner">
+    <div class="container">
+	    <a href="/" class="logo"><img src="/assets/img/aurora_logo.png" alt="Apache Aurora logo" /></a>
+      <ul class="nav">
+				<li><a href="/documentation/latest/">Documentation</a></li>
+        <li><a href="/downloads/">Download</a></li>
+        <li><a href="/community">Community</a></li>
+      </ul>
+    </div>
+  </div>
+</div>
+
+<div class="container">
+<!-- magical breadcrumbs -->
+<ul class="breadcrumb">
+  <li>
+    <div class="dropdown">
+      <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation <b class="caret"></b></a>
+      <ul class="dropdown-menu" role="menu">
+        <li><a href="http://www.apache.org">Apache Homepage</a></li>
+        <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+        <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>  
+        <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+        <li><a href="http://www.apache.org/security/">Security</a></li>
+      </ul>
+    </div>
+  </li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://incubator.apache.org">Apache Incubator</a></li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://aurora.incubator.apache.org">Apache Aurora</a></li>
+</ul>
+<!-- /breadcrumb -->
+	
+      <div class="container">
+        <h1 id="monitoring-your-aurora-cluster">Monitoring your Aurora cluster</h1>
+
+<p>Before you start running important services in your Aurora cluster, it&rsquo;s important to set up
+monitoring and alerting of Aurora itself.  Most of your monitoring can be against the scheduler,
+since it will give you a global view of what&rsquo;s going on.</p>
+
+<h2 id="reading-stats">Reading stats</h2>
+
+<p>The scheduler exposes a <em>lot</em> of instrumentation data via its HTTP interface. You can get a quick
+peek at the first few of these in our vagrant image:</p>
+<pre class="highlight text">$ vagrant ssh -c &#39;curl -s localhost:8081/vars | head&#39;
+async_tasks_completed 1004
+attribute_store_fetch_all_events 15
+attribute_store_fetch_all_events_per_sec 0.0
+attribute_store_fetch_all_nanos_per_event 0.0
+attribute_store_fetch_all_nanos_total 3048285
+attribute_store_fetch_all_nanos_total_per_sec 0.0
+attribute_store_fetch_one_events 3391
+attribute_store_fetch_one_events_per_sec 0.0
+attribute_store_fetch_one_nanos_per_event 0.0
+attribute_store_fetch_one_nanos_total 454690753
+</pre>
+<p>These values are served as <code>Content-Type: text/plain</code>, with each line containing a space-separated metric
+name and value. Values may be integers, doubles, or strings (note: strings are static, others
+may be dynamic).</p>
+
+<p>If your monitoring infrastructure prefers JSON, the scheduler exports that as well:</p>
+<pre class="highlight text">$ vagrant ssh -c &#39;curl -s localhost:8081/vars.json | python -mjson.tool | head&#39;
+{
+    &quot;async_tasks_completed&quot;: 1009,
+    &quot;attribute_store_fetch_all_events&quot;: 15,
+    &quot;attribute_store_fetch_all_events_per_sec&quot;: 0.0,
+    &quot;attribute_store_fetch_all_nanos_per_event&quot;: 0.0,
+    &quot;attribute_store_fetch_all_nanos_total&quot;: 3048285,
+    &quot;attribute_store_fetch_all_nanos_total_per_sec&quot;: 0.0,
+    &quot;attribute_store_fetch_one_events&quot;: 3409,
+    &quot;attribute_store_fetch_one_events_per_sec&quot;: 0.0,
+    &quot;attribute_store_fetch_one_nanos_per_event&quot;: 0.0,
+</pre>
+<p>This will be the same data as above, served with <code>Content-Type: application/json</code>.</p>
+
+<h2 id="viewing-live-stat-samples-on-the-scheduler">Viewing live stat samples on the scheduler</h2>
+
+<p>The scheduler uses the Twitter commons stats library, which keeps an internal time-series database
+of exported variables - nearly everything in <code>/vars</code> is available for instant graphing.  This is
+useful for debugging, but is not a replacement for an external monitoring system.</p>
+
+<p>You can view these graphs on a scheduler at <code>/graphview</code>.  It supports some composition and
+aggregation of values, which can be invaluable when triaging a problem.  For example, if you have
+the scheduler running in vagrant, check out these links:
+<a href="http://192.168.33.7:8081/graphview?query=jvm_uptime_secs">simple graph</a>
+<a href="http://192.168.33.7:8081/graphview?query=rate(scheduler_log_native_append_nanos_total)%2Frate(scheduler_log_native_append_events)%2F1e6">complex composition</a></p>
+
+<h3 id="counters-and-gauges">Counters and gauges</h3>
+
+<p>Among numeric stats, there are two fundamental types of stats exported: <em>counters</em> and <em>gauges</em>.
+Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges
+may decrease in value.  Aurora uses counters to represent things like the number of times an event
+has occurred, and gauges to capture things like the current length of a queue.  Counters are a
+natural fit for accurate composition into <a href="http://en.wikipedia.org/wiki/Rate_ratio">rate ratios</a>
+(useful for sample-resistant latency calculation), while gauges are not.</p>
+
+<h1 id="alerting">Alerting</h1>
+
+<h2 id="quickstart">Quickstart</h2>
+
+<p>If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting
+on <code>framework_registered</code> and <code>task_store_LOST</code>. These will give you a decent picture of overall
+health.</p>
+
+<h2 id="a-note-on-thresholds">A note on thresholds</h2>
+
+<p>One of the most difficult things in monitoring is choosing alert thresholds. With many of these
+stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It
+will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We
+recommend you start with a strict value after viewing a small amount of collected data, and then
+adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts
+and thresholds make sense.</p>
+
+<h4 id="jvm_uptime_secs"><code>jvm_uptime_secs</code></h4>
+
+<p>Type: integer counter</p>
+
+<h4 id="description">Description</h4>
+
+<p>The number of seconds the JVM process has been running. Comes from
+<a href="http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime()">RuntimeMXBean#getUptime()</a></p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to
+stay alive.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>Look at the scheduler logs to identify the reason the scheduler is exiting.</p>
+
+<h4 id="system_load_avg"><code>system_load_avg</code></h4>
+
+<p>Type: double gauge</p>
+
+<h4 id="description">Description</h4>
+
+<p>The current load average of the system for the last minute. Comes from
+<a href="http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage()">OperatingSystemMXBean#getSystemLoadAverage()</a>.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>A high sustained value suggests that the scheduler machine may be over-utilized.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>Use standard unix tools like <code>top</code> and <code>ps</code> to track down the offending process(es).</p>
+
+<h4 id="process_cpu_cores_utilized"><code>process_cpu_cores_utilized</code></h4>
+
+<p>Type: double gauge</p>
+
+<h4 id="description">Description</h4>
+
+<p>The current number of CPU cores in use by the JVM process. This should not exceed the number of
+logical CPU cores on the machine. Derived from
+<a href="http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html">OperatingSystemMXBean#getProcessCpuTime()</a></p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>A high sustained value indicates that the scheduler is overworked. Due to current internal design
+limitations, if this value is sustained at <code>1</code>, there is a good chance the scheduler is under water.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>There are two main inputs that tend to drive this figure: task scheduling attempts and status
+updates from Mesos.  You may see activity in the scheduler logs to give an indication of where
+time is being spent.  Beyond that, it really takes good familiarity with the code to effectively
+triage this.  We suggest engaging with an Aurora developer.</p>
+
+<h4 id="task_store_lost"><code>task_store_LOST</code></h4>
+
+<p>Type: integer gauge</p>
+
+<h4 id="description">Description</h4>
+
+<p>The number of tasks stored in the scheduler that are in the <code>LOST</code> state, and have been rescheduled.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>If this value is increasing at a high rate, it is a sign of trouble.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>There are many sources of <code>LOST</code> tasks in Mesos: the scheduler, master, slave, and executor can all
+trigger this.  The first step is to look in the scheduler logs for <code>LOST</code> to identify where the
+state changes are originating.</p>
+
+<h4 id="scheduler_resource_offers"><code>scheduler_resource_offers</code></h4>
+
+<p>Type: integer counter</p>
+
+<h4 id="description">Description</h4>
+
+<p>The number of resource offers that the scheduler has received.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>For a healthy scheduler, this value must be increasing over time.</p>
+
+<h5 id="triage">Triage</h5>
+
+<p>Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it
+is sending offers. You should also look at the master&rsquo;s web interface to see if it has a large
+number of outstanding offers that it is waiting to be returned.</p>
+
+<h4 id="framework_registered"><code>framework_registered</code></h4>
+
+<p>Type: binary integer counter</p>
+
+<h4 id="description">Description</h4>
+
+<p>Will be <code>1</code> for the leading scheduler that is registered with the Mesos master, <code>0</code> for passive
+schedulers,</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>A sustained period without a <code>1</code> (or where <code>sum() != 1</code>) warrants investigation.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>If there is no leading scheduler, look in the scheduler and master logs for why.  If there are
+multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical
+bug.</p>
+
+<h4 id="rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)"><code>rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)</code></h4>
+
+<p>Type: rate ratio of integer counters</p>
+
+<h4 id="description">Description</h4>
+
+<p>This composes two counters to compute a windowed figure for the latency of replicated log writes.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>A hike in this value suggests disk bandwidth contention.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>Look in scheduler logs for any reported oddness with saving to the replicated log. Also use
+standard tools like <code>vmstat</code> and <code>iotop</code> to identify whether the disk has become slow or
+over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.</p>
+
+<h4 id="timed_out_tasks"><code>timed_out_tasks</code></h4>
+
+<p>Type: integer counter</p>
+
+<h4 id="description">Description</h4>
+
+<p>Tracks the number of times the scheduler has given up while waiting
+(for <code>-transient_task_state_timeout</code>) to hear back about a task that is in a transient state
+(e.g. <code>ASSIGNED</code>, <code>KILLING</code>), and has moved to <code>LOST</code> before rescheduling.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>This value is currently known to increase occasionally when the scheduler fails over
+(<a href="https://issues.apache.org/jira/browse/AURORA-740">AURORA-740</a>). However, any large spike in this
+value warrants investigation.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>The scheduler will log when it times out a task. You should trace the task ID of the timed out
+task into the master, slave, and/or executors to determine where the message was dropped.</p>
+
+<h4 id="http_500_responses_events"><code>http_500_responses_events</code></h4>
+
+<p>Type: integer counter</p>
+
+<h4 id="description">Description</h4>
+
+<p>The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.</p>
+
+<h4 id="alerting">Alerting</h4>
+
+<p>An increase warrants investigation.</p>
+
+<h4 id="triage">Triage</h4>
+
+<p>Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.</p>
+
+	  </div>
+      <div class="container">
+    <hr>
+    <footer class="footer">
+        <div class="row-fluid">
+            <div class="span2 text-left">
+                <h3>Links</h3>
+                <ul class="unstyled">
+                    <li><a href="/downloads/">Downloads</a></li>
+                    <li><a href="/developers/">Developers</a></li>                    
+                </ul>
+            </div>
+            <div class="span3 text-left">
+                <h3>Community</h3>
+                <ul class="unstyled">
+                    <li><a href="/community/">Mailing Lists</a></li>
+                    <li><a href="http://issues.apache.org/jira/browse/aurora">Issue Tracking</a></li>
+                    <li><a href="/docs/howtocontribute/">How To Contribute</a></li>
+                </ul>
+            </div>
+            <div class="span7 text-left">
+            	<h3>Apache Software Foundation</h3>
+
+							<div class="span8">
+                Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. Apache, Apache Thrift, and the Apache feather logo are trademarks of The Apache Software Foundation. Currently part of the <a href="http://incubator.apache.org">Apache Incubator</a>.
+							</div>
+							<div class=" pull-right">
+								<a href="http://incubator.apache.org" class="logo"><img src="/assets/img/apache_incubator_logo.png" alt="Apache Incubator" class="pull-right"/></a>
+							</div>
+            </div>
+
+        </div>
+
+    </footer>
+</div>
+
+	</body>
+</html>
+

Added: incubator/aurora/site/publish/documentation/latest/scheduler-storage/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/scheduler-storage/index.html?rev=1632428&view=auto
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/scheduler-storage/index.html (added)
+++ incubator/aurora/site/publish/documentation/latest/scheduler-storage/index.html Thu Oct 16 20:00:27 2014
@@ -0,0 +1,154 @@
+<html>
+    <head>
+        <meta charset="utf-8">
+        <title>Apache Aurora</title>
+		    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+		    <meta name="description" content="">
+		    <meta name="author" content="">
+
+		    <link href="/assets/css/bootstrap.css" rel="stylesheet">
+		    <link href="/assets/css/bootstrap-responsive.min.css" rel="stylesheet">
+		    <link href="/assets/css/main.css" rel="stylesheet">
+				
+		    <!-- JS -->
+		    <script type="text/javascript" src="/assets/js/jquery-1.10.1.min.js"></script>
+		    <script type="text/javascript" src="/assets/js/bootstrap-dropdown.js"></script>
+		
+				<!-- Analytics -->
+				<script type="text/javascript">
+					  var _gaq = _gaq || [];
+					  _gaq.push(['_setAccount', 'UA-45879646-1']);
+					  _gaq.push(['_setDomainName', 'apache.org']);
+					  _gaq.push(['_trackPageview']);
+
+					  (function() {
+					    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+					    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+					    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+					  })();
+				</script>
+	</head>
+    <body>	
+      <div class="navbar navbar-static-top">
+  <div class="navbar-inner">
+    <div class="container">
+	    <a href="/" class="logo"><img src="/assets/img/aurora_logo.png" alt="Apache Aurora logo" /></a>
+      <ul class="nav">
+				<li><a href="/documentation/latest/">Documentation</a></li>
+        <li><a href="/downloads/">Download</a></li>
+        <li><a href="/community">Community</a></li>
+      </ul>
+    </div>
+  </div>
+</div>
+
+<div class="container">
+<!-- magical breadcrumbs -->
+<ul class="breadcrumb">
+  <li>
+    <div class="dropdown">
+      <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation <b class="caret"></b></a>
+      <ul class="dropdown-menu" role="menu">
+        <li><a href="http://www.apache.org">Apache Homepage</a></li>
+        <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+        <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>  
+        <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+        <li><a href="http://www.apache.org/security/">Security</a></li>
+      </ul>
+    </div>
+  </li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://incubator.apache.org">Apache Incubator</a></li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://aurora.incubator.apache.org">Apache Aurora</a></li>
+</ul>
+<!-- /breadcrumb -->
+	
+      <div class="container">
+        <h1 id="snapshot-performance">Snapshot Performance</h1>
+
+<p>Periodically the scheduler writes a full snapshot of its state to the replicated log. To do this
+it needs to hold a global storage write lock while it writes out this data. In large clusters
+this has been observed to take up to 40 seconds. Long pauses can cause issues in the system,
+including delays in scheduling new tasks.</p>
+
+<p>The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot
+performance: compression and deduplication. Most users will want to enable both compression
+and deduplication.</p>
+
+<h2 id="compression">Compression</h2>
+
+<p>To reduce the size of the snapshot the DEFLATE algorithm can be applied to the serialized bytes
+of the snapshot as they are written to the stream. This reduces the total number of bytes that
+need to be written to the replicated log at the cost of CPU and generally reduces the amount
+of time a snapshot takes. Most users will want to enable both compression and deduplication.</p>
+
+<h3 id="enabling-compression">Enabling Compression</h3>
+
+<p>Snapshot compression is enabled via the <code>-deflate_snapshots</code> flag. This is the default since
+Aurora 0.5.0. All released versions of Aurora can read both compressed and uncompressed snapshots,
+so there are no backwards compatibility concerns associated with changing this flag.</p>
+
+<h3 id="disabling-compression">Disabling compression</h3>
+
+<p>Disable compression by passing <code>-deflate_snapshots=false</code>.</p>
+
+<h2 id="deduplication">Deduplication</h2>
+
+<p>In Aurora 0.6.0 a new snapshot format was introduced. Rather than write one configuration blob
+per Mesos task this format stores each configuration blob once, and each Mesos task with a
+pointer to its blob. This format is not backwards compatible with earlier versions of Aurora.</p>
+
+<h3 id="enabling-deduplication">Enabling Deduplication</h3>
+
+<p>After upgrading Aurora to 0.6.0, enable deduplication with the <code>-deduplicate_snapshots</code> flag.
+After the first snapshot the cluster will be using the deduplicated format to write to the
+replicated log. Snapshots are created periodically by the scheduler (according to
+the <code>-dlog_snapshot_interval</code> flag). An administrator can also force a snapshot operation with
+<code>aurora_admin snapshot</code>.</p>
+
+<h3 id="disabling-deduplication">Disabling Deduplication</h3>
+
+<p>To disable deduplication, for example to rollback to Aurora, restart all of the cluster&rsquo;s
+schedulers with <code>-deduplicate_snapshots=false</code> and either wait for a snapshot or force one
+using <code>aurora_admin snapshot</code>.</p>
+
+	  </div>
+      <div class="container">
+    <hr>
+    <footer class="footer">
+        <div class="row-fluid">
+            <div class="span2 text-left">
+                <h3>Links</h3>
+                <ul class="unstyled">
+                    <li><a href="/downloads/">Downloads</a></li>
+                    <li><a href="/developers/">Developers</a></li>                    
+                </ul>
+            </div>
+            <div class="span3 text-left">
+                <h3>Community</h3>
+                <ul class="unstyled">
+                    <li><a href="/community/">Mailing Lists</a></li>
+                    <li><a href="http://issues.apache.org/jira/browse/aurora">Issue Tracking</a></li>
+                    <li><a href="/docs/howtocontribute/">How To Contribute</a></li>
+                </ul>
+            </div>
+            <div class="span7 text-left">
+            	<h3>Apache Software Foundation</h3>
+
+							<div class="span8">
+                Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>. Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>. Apache, Apache Thrift, and the Apache feather logo are trademarks of The Apache Software Foundation. Currently part of the <a href="http://incubator.apache.org">Apache Incubator</a>.
+							</div>
+							<div class=" pull-right">
+								<a href="http://incubator.apache.org" class="logo"><img src="/assets/img/apache_incubator_logo.png" alt="Apache Incubator" class="pull-right"/></a>
+							</div>
+            </div>
+
+        </div>
+
+    </footer>
+</div>
+
+	</body>
+</html>
+

Modified: incubator/aurora/site/source/documentation/latest/committers.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/committers.md?rev=1632428&r1=1632427&r2=1632428&view=diff
==============================================================================
--- incubator/aurora/site/source/documentation/latest/committers.md (original)
+++ incubator/aurora/site/source/documentation/latest/committers.md Thu Oct 16 20:00:27 2014
@@ -13,6 +13,36 @@ The recommended setup is to configure al
 emails to your @apache.org email address.
 
 
+Creating a gpg key for releases
+-------------------------------
+In order to create a release candidate you will need a gpg key published to an external key server
+and that key will need to be added to our KEYS file as well.
+
+1. Create a key:
+
+               gpg --gen-key
+
+2. Add your gpg key to the Apache Aurora KEYS file:
+
+               git clone https://git-wip-us.apache.org/repos/asf/incubator-aurora.git
+               (gpg --list-sigs <KEY ID> && gpg --armor --export <KEY ID>) >> KEYS
+               git add KEYS && git commit -m "Adding gpg key for <APACHE ID>"
+               ./rbt post -o -g
+
+3. Publish the key to an external key server:
+
+               gpg --keyserver pgp.mit.edu --send-keys <KEY ID>
+
+4. Update the changes to the KEYS file to the Apache Aurora svn dist locations listed below:
+
+               https://dist.apache.org/repos/dist/dev/incubator/aurora/KEYS
+               https://dist.apache.org/repos/dist/release/incubator/aurora/KEYS
+
+5. Add your key to git config for use with the release scripts:
+
+               git config --global user.signingkey <KEY ID>
+
+
 Creating a release
 ------------------
 The following will guide you through the steps to create a release candidate, vote, and finally an

Modified: incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md?rev=1632428&r1=1632427&r2=1632428&view=diff
==============================================================================
--- incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md (original)
+++ incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md Thu Oct 16 20:00:27 2014
@@ -1,14 +1,12 @@
 The Aurora scheduler is responsible for scheduling new jobs, rescheduling failed jobs, and killing
 old jobs.
 
-Installing Aurora
-=================
+# Installing Aurora
 Aurora is a standalone Java server. As part of the build process it creates a bundle of all its
 dependencies, with the notable exceptions of the JVM and libmesos. Each target server should have
 a JVM (Java 7 or higher) and libmesos (0.18.0) installed.
 
-Creating the Distribution .zip File (Optional)
-----------------------------------------------
+## Creating the Distribution .zip File (Optional)
 To create a distribution for installation you will need build tools installed. On Ubuntu this can be
 done with `sudo apt-get install build-essential default-jdk`.
 
@@ -18,19 +16,16 @@ done with `sudo apt-get install build-es
 
 Copy the generated `dist/distributions/aurora-scheduler-*.zip` to each node that will run a scheduler.
 
-Installing Aurora
------------------
+## Installing Aurora
 Extract the aurora-scheduler zip file. The example configurations assume it is extracted to
 `/usr/local/aurora-scheduler`.
 
     sudo unzip dist/distributions/aurora-scheduler-*.zip -d /usr/local
     sudo ln -nfs "$(ls -dt /usr/local/aurora-scheduler-* | head -1)" /usr/local/aurora-scheduler
 
-Configuring Aurora
-==================
+# Configuring Aurora
 
-A Note on Configuration
------------------------
+## A Note on Configuration
 Like Mesos, Aurora uses command-line flags for runtime configuration. As such the Aurora
 "configuration file" is typically a `scheduler.sh` shell script of the form.
 
@@ -64,8 +59,7 @@ documentation run
 
     /usr/local/aurora-scheduler/bin/aurora-scheduler -help
 
-Replicated Log Configuration
-----------------------------
+## Replicated Log Configuration
 All Aurora state is persisted to a replicated log. This includes all jobs Aurora is running
 including where in the cluster they are being run and the configuration for running them, as
 well as other information such as metadata needed to reconnect to the Mesos master, resource
@@ -89,8 +83,20 @@ should be set to `2`, and in a cluster o
 
 *Incorrectly setting this flag will cause data corruption to occur!*
 
-Network considerations
-----------------------
+## Initializing the Replicated Log
+Before you start Aurora you will also need to initialize the log on the first master.
+
+    mesos-log initialize --path="$AURORA_HOME/scheduler/db"
+
+Failing to do this will result the following message when you try to start the scheduler.
+
+    Replica in EMPTY status received a broadcasted recover request
+
+## Storage Performance Considerations
+
+See [this document](/documentation/latest/scheduler-storage/) for scheduler storage performance considerations.
+
+## Network considerations
 The Aurora scheduler listens on 2 ports - an HTTP port used for client RPCs and a web UI,
 and a libprocess (HTTP+Protobuf) port used to communicate with the Mesos master and for the log
 replication protocol. These can be left unconfigured (the scheduler publishes all selected ports
@@ -106,8 +112,7 @@ to ZooKeeper) or explicitly set in the s
     export LIBPROCESS_PORT=8083
     # ...
 
-Running Aurora
-==============
+# Running Aurora
 Configure a supervisor like [Monit](http://mmonit.com/monit/) or
 [supervisord](http://supervisord.org/) to run the created `scheduler.sh` file and restart it
 whenever it fails. Aurora expects to be restarted by an external process when it fails. Aurora
@@ -121,11 +126,48 @@ For example, monit can be configured wit
 
 assuming you set `-http_port=8081`.
 
-Maintaining an Aurora Installation
-==================================
+## Maintaining an Aurora Installation
+
+## Monitoring
+Please see our dedicated [monitoring guide](/documentation/latest/monitoring/) for in-depth discussion on monitoring.
+
+## Running stateful services
+Aurora is best suited to run stateless applications, but it also accommodates for stateful services
+like databases, or services that otherwise need to always run on the same machines.
+
+### Dedicated attribute
+The Mesos slave has the `--attributes` command line argument which can be used to mark a slave with
+static attributes (not to be confused with `--resources`, which are dynamic and accounted).
+
+Aurora makes these attributes available for matching with scheduling
+[constraints](configuration-reference.md#specifying-scheduling-constraints).  Most of these
+constraints are arbitrary and available for custom use.  There is one exception, though: the
+`dedicated` attribute.  Aurora treats this specially, and only allows matching jobs to run on these
+machines, and will only schedule matching jobs on these machines.
+
+#### Syntax
+The dedicated attribute has semantic meaning. The format is `$role(/.*)?`. When a job is created,
+the scheduler requires that the `$role` component matches the `role` field in the job
+configuration, and will reject the job creation otherwise.  The remainder of the attribute is
+free-form. We've developed the idiom of formatting this attribute as `$role/$job`, but do not
+enforce this.
+
+#### Example
+Consider the following slave command line:
+
+    mesos-slave --attributes="host:$HOST;rack:$RACK;dedicated:db_team/redis" ...
+
+And this job configuration:
+
+    Service(
+      name = 'redis',
+      role = 'db_team',
+      constraints = {
+        'dedicated': 'db_team/redis'
+      }
+      ...
+    )
 
-Monitoring
-----------
-Aurora exports performance metrics via its HTTP interface `/vars` and `/vars.json` contain lots of
-useful data to help debug performance and configuration problems. These are all made available via
-[twitter.common.http](https://github.com/twitter/commons/tree/master/src/java/com/twitter/commons/http).
+The job configuration is indicating that it should only be scheduled on slaves with the attribute
+`dedicated:dba_team/redis`.  Additionally, Aurora will prevent any tasks that do _not_ have that
+constraint from running on those slaves.

Added: incubator/aurora/site/source/documentation/latest/monitoring.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/monitoring.md?rev=1632428&view=auto
==============================================================================
--- incubator/aurora/site/source/documentation/latest/monitoring.md (added)
+++ incubator/aurora/site/source/documentation/latest/monitoring.md Thu Oct 16 20:00:27 2014
@@ -0,0 +1,206 @@
+# Monitoring your Aurora cluster
+
+Before you start running important services in your Aurora cluster, it's important to set up
+monitoring and alerting of Aurora itself.  Most of your monitoring can be against the scheduler,
+since it will give you a global view of what's going on.
+
+## Reading stats
+The scheduler exposes a *lot* of instrumentation data via its HTTP interface. You can get a quick
+peek at the first few of these in our vagrant image:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars | head'
+    async_tasks_completed 1004
+    attribute_store_fetch_all_events 15
+    attribute_store_fetch_all_events_per_sec 0.0
+    attribute_store_fetch_all_nanos_per_event 0.0
+    attribute_store_fetch_all_nanos_total 3048285
+    attribute_store_fetch_all_nanos_total_per_sec 0.0
+    attribute_store_fetch_one_events 3391
+    attribute_store_fetch_one_events_per_sec 0.0
+    attribute_store_fetch_one_nanos_per_event 0.0
+    attribute_store_fetch_one_nanos_total 454690753
+
+These values are served as `Content-Type: text/plain`, with each line containing a space-separated metric
+name and value. Values may be integers, doubles, or strings (note: strings are static, others
+may be dynamic).
+
+If your monitoring infrastructure prefers JSON, the scheduler exports that as well:
+
+    $ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
+    {
+        "async_tasks_completed": 1009,
+        "attribute_store_fetch_all_events": 15,
+        "attribute_store_fetch_all_events_per_sec": 0.0,
+        "attribute_store_fetch_all_nanos_per_event": 0.0,
+        "attribute_store_fetch_all_nanos_total": 3048285,
+        "attribute_store_fetch_all_nanos_total_per_sec": 0.0,
+        "attribute_store_fetch_one_events": 3409,
+        "attribute_store_fetch_one_events_per_sec": 0.0,
+        "attribute_store_fetch_one_nanos_per_event": 0.0,
+
+This will be the same data as above, served with `Content-Type: application/json`.
+
+## Viewing live stat samples on the scheduler
+The scheduler uses the Twitter commons stats library, which keeps an internal time-series database
+of exported variables - nearly everything in `/vars` is available for instant graphing.  This is
+useful for debugging, but is not a replacement for an external monitoring system.
+
+You can view these graphs on a scheduler at `/graphview`.  It supports some composition and
+aggregation of values, which can be invaluable when triaging a problem.  For example, if you have
+the scheduler running in vagrant, check out these links:
+[simple graph](http://192.168.33.7:8081/graphview?query=jvm_uptime_secs)
+[complex composition](http://192.168.33.7:8081/graphview?query=rate\(scheduler_log_native_append_nanos_total\)%2Frate\(scheduler_log_native_append_events\)%2F1e6)
+
+### Counters and gauges
+Among numeric stats, there are two fundamental types of stats exported: _counters_ and _gauges_.
+Counters are guaranteed to be monotonically-increasing for the lifetime of a process, while gauges
+may decrease in value.  Aurora uses counters to represent things like the number of times an event
+has occurred, and gauges to capture things like the current length of a queue.  Counters are a
+natural fit for accurate composition into [rate ratios](http://en.wikipedia.org/wiki/Rate_ratio)
+(useful for sample-resistant latency calculation), while gauges are not.
+
+# Alerting
+
+## Quickstart
+If you are looking for just bare-minimum alerting to get something in place quickly, set up alerting
+on `framework_registered` and `task_store_LOST`. These will give you a decent picture of overall
+health.
+
+## A note on thresholds
+One of the most difficult things in monitoring is choosing alert thresholds. With many of these
+stats, there is no value we can offer as a threshold that will be guaranteed to work for you. It
+will depend on the size of your cluster, number of jobs, churn of tasks in the cluster, etc. We
+recommend you start with a strict value after viewing a small amount of collected data, and then
+adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts
+and thresholds make sense.
+
+#### `jvm_uptime_secs`
+Type: integer counter
+
+#### Description
+The number of seconds the JVM process has been running. Comes from
+[RuntimeMXBean#getUptime()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/RuntimeMXBean.html#getUptime\(\))
+
+#### Alerting
+Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to
+stay alive.
+
+#### Triage
+Look at the scheduler logs to identify the reason the scheduler is exiting.
+
+#### `system_load_avg`
+Type: double gauge
+
+#### Description
+The current load average of the system for the last minute. Comes from
+[OperatingSystemMXBean#getSystemLoadAverage()](http://docs.oracle.com/javase/7/docs/api/java/lang/management/OperatingSystemMXBean.html?is-external=true#getSystemLoadAverage\(\)).
+
+#### Alerting
+A high sustained value suggests that the scheduler machine may be over-utilized.
+
+#### Triage
+Use standard unix tools like `top` and `ps` to track down the offending process(es).
+
+#### `process_cpu_cores_utilized`
+Type: double gauge
+
+#### Description
+The current number of CPU cores in use by the JVM process. This should not exceed the number of
+logical CPU cores on the machine. Derived from
+[OperatingSystemMXBean#getProcessCpuTime()](http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html)
+
+#### Alerting
+A high sustained value indicates that the scheduler is overworked. Due to current internal design
+limitations, if this value is sustained at `1`, there is a good chance the scheduler is under water.
+
+#### Triage
+There are two main inputs that tend to drive this figure: task scheduling attempts and status
+updates from Mesos.  You may see activity in the scheduler logs to give an indication of where
+time is being spent.  Beyond that, it really takes good familiarity with the code to effectively
+triage this.  We suggest engaging with an Aurora developer.
+
+#### `task_store_LOST`
+Type: integer gauge
+
+#### Description
+The number of tasks stored in the scheduler that are in the `LOST` state, and have been rescheduled.
+
+#### Alerting
+If this value is increasing at a high rate, it is a sign of trouble.
+
+#### Triage
+There are many sources of `LOST` tasks in Mesos: the scheduler, master, slave, and executor can all
+trigger this.  The first step is to look in the scheduler logs for `LOST` to identify where the
+state changes are originating.
+
+#### `scheduler_resource_offers`
+Type: integer counter
+
+#### Description
+The number of resource offers that the scheduler has received.
+
+#### Alerting
+For a healthy scheduler, this value must be increasing over time.
+
+##### Triage
+Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it
+is sending offers. You should also look at the master's web interface to see if it has a large
+number of outstanding offers that it is waiting to be returned.
+
+#### `framework_registered`
+Type: binary integer counter
+
+#### Description
+Will be `1` for the leading scheduler that is registered with the Mesos master, `0` for passive
+schedulers,
+
+#### Alerting
+A sustained period without a `1` (or where `sum() != 1`) warrants investigation.
+
+#### Triage
+If there is no leading scheduler, look in the scheduler and master logs for why.  If there are
+multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical
+bug.
+
+#### `rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)`
+Type: rate ratio of integer counters
+
+#### Description
+This composes two counters to compute a windowed figure for the latency of replicated log writes.
+
+#### Alerting
+A hike in this value suggests disk bandwidth contention.
+
+#### Triage
+Look in scheduler logs for any reported oddness with saving to the replicated log. Also use
+standard tools like `vmstat` and `iotop` to identify whether the disk has become slow or
+over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.
+
+#### `timed_out_tasks`
+Type: integer counter
+
+#### Description
+Tracks the number of times the scheduler has given up while waiting
+(for `-transient_task_state_timeout`) to hear back about a task that is in a transient state
+(e.g. `ASSIGNED`, `KILLING`), and has moved to `LOST` before rescheduling.
+
+#### Alerting
+This value is currently known to increase occasionally when the scheduler fails over
+([AURORA-740](https://issues.apache.org/jira/browse/AURORA-740)). However, any large spike in this
+value warrants investigation.
+
+#### Triage
+The scheduler will log when it times out a task. You should trace the task ID of the timed out
+task into the master, slave, and/or executors to determine where the message was dropped.
+
+#### `http_500_responses_events`
+Type: integer counter
+
+#### Description
+The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.
+
+#### Alerting
+An increase warrants investigation.
+
+#### Triage
+Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.

Added: incubator/aurora/site/source/documentation/latest/scheduler-storage.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/scheduler-storage.md?rev=1632428&view=auto
==============================================================================
--- incubator/aurora/site/source/documentation/latest/scheduler-storage.md (added)
+++ incubator/aurora/site/source/documentation/latest/scheduler-storage.md Thu Oct 16 20:00:27 2014
@@ -0,0 +1,47 @@
+# Snapshot Performance
+
+Periodically the scheduler writes a full snapshot of its state to the replicated log. To do this
+it needs to hold a global storage write lock while it writes out this data. In large clusters
+this has been observed to take up to 40 seconds. Long pauses can cause issues in the system,
+including delays in scheduling new tasks.
+
+The scheduler has two optimizations to reduce the size of snapshots and thus improve snapshot
+performance: compression and deduplication. Most users will want to enable both compression
+and deduplication.
+
+## Compression
+
+To reduce the size of the snapshot the DEFLATE algorithm can be applied to the serialized bytes
+of the snapshot as they are written to the stream. This reduces the total number of bytes that
+need to be written to the replicated log at the cost of CPU and generally reduces the amount
+of time a snapshot takes. Most users will want to enable both compression and deduplication.
+
+### Enabling Compression
+
+Snapshot compression is enabled via the `-deflate_snapshots` flag. This is the default since
+Aurora 0.5.0. All released versions of Aurora can read both compressed and uncompressed snapshots,
+so there are no backwards compatibility concerns associated with changing this flag.
+
+### Disabling compression
+
+Disable compression by passing `-deflate_snapshots=false`.
+
+## Deduplication
+
+In Aurora 0.6.0 a new snapshot format was introduced. Rather than write one configuration blob
+per Mesos task this format stores each configuration blob once, and each Mesos task with a
+pointer to its blob. This format is not backwards compatible with earlier versions of Aurora.
+
+### Enabling Deduplication
+
+After upgrading Aurora to 0.6.0, enable deduplication with the `-deduplicate_snapshots` flag.
+After the first snapshot the cluster will be using the deduplicated format to write to the
+replicated log. Snapshots are created periodically by the scheduler (according to
+the `-dlog_snapshot_interval` flag). An administrator can also force a snapshot operation with
+`aurora_admin snapshot`.
+
+### Disabling Deduplication
+
+To disable deduplication, for example to rollback to Aurora, restart all of the cluster's
+schedulers with `-deduplicate_snapshots=false` and either wait for a snapshot or force one
+using `aurora_admin snapshot`.



Mime
View raw message