aurora-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dles...@apache.org
Subject svn commit: r1634176 - in /incubator/aurora/site: publish/documentation/latest/deploying-aurora-scheduler/ publish/documentation/latest/storage-config/ publish/documentation/latest/storage/ source/documentation/latest/
Date Sat, 25 Oct 2014 03:36:59 GMT
Author: dlester
Date: Sat Oct 25 03:36:58 2014
New Revision: 1634176

URL: http://svn.apache.org/r1634176
Log:
Updates Aurora docs to be in sync with git.

Added:
    incubator/aurora/site/publish/documentation/latest/storage-config/
    incubator/aurora/site/publish/documentation/latest/storage-config/index.html
    incubator/aurora/site/source/documentation/latest/storage-config.md
Modified:
    incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
    incubator/aurora/site/publish/documentation/latest/storage/index.html
    incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md
    incubator/aurora/site/source/documentation/latest/storage.md

Modified: incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html?rev=1634176&r1=1634175&r2=1634176&view=diff
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
(original)
+++ incubator/aurora/site/publish/documentation/latest/deploying-aurora-scheduler/index.html
Sat Oct 25 03:36:58 2014
@@ -118,7 +118,7 @@ machines.  This guide helps you get the 
 
 <p>The Aurora scheduler is a standalone Java server. As part of the build process it
creates a bundle
 of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target
server
-should have a JVM (Java 7 or higher) and libmesos (0.20.0) installed.</p>
+should have a JVM (Java 7 or higher) and libmesos (0.20.1) installed.</p>
 
 <h3 id="creating-the-distribution-.zip-file-(optional)">Creating the Distribution .zip
File (Optional)</h3>
 
@@ -214,7 +214,10 @@ should be set to <code>2</code>, and in 
 
 <p><em>Incorrectly setting this flag will cause data corruption to occur!</em></p>
 
-<h3 id="initializing-the-replicated-log">Initializing the Replicated Log</h3>
+<p>See <a href="storage-config.md#scheduler-storage-configuration-flags">this
document</a> for more replicated
+log and storage configuration options.</p>
+
+<h2 id="initializing-the-replicated-log">Initializing the Replicated Log</h2>
 
 <p>Before you start Aurora you will also need to initialize the log on the first master.</p>
 <pre class="highlight text">mesos-log initialize --path=&quot;$AURORA_HOME/scheduler/db&quot;

Added: incubator/aurora/site/publish/documentation/latest/storage-config/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/storage-config/index.html?rev=1634176&view=auto
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/storage-config/index.html (added)
+++ incubator/aurora/site/publish/documentation/latest/storage-config/index.html Sat Oct 25
03:36:58 2014
@@ -0,0 +1,271 @@
+<html>
+    <head>
+        <meta charset="utf-8">
+        <title>Apache Aurora</title>
+		    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+		    <meta name="description" content="">
+		    <meta name="author" content="">
+
+		    <link href="/assets/css/bootstrap.css" rel="stylesheet">
+		    <link href="/assets/css/bootstrap-responsive.min.css" rel="stylesheet">
+		    <link href="/assets/css/main.css" rel="stylesheet">
+				
+		    <!-- JS -->
+		    <script type="text/javascript" src="/assets/js/jquery-1.10.1.min.js"></script>
+		    <script type="text/javascript" src="/assets/js/bootstrap-dropdown.js"></script>
+		
+				<!-- Analytics -->
+				<script type="text/javascript">
+					  var _gaq = _gaq || [];
+					  _gaq.push(['_setAccount', 'UA-45879646-1']);
+					  _gaq.push(['_setDomainName', 'apache.org']);
+					  _gaq.push(['_trackPageview']);
+
+					  (function() {
+					    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async
= true;
+					    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www')
+ '.google-analytics.com/ga.js';
+					    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga,
s);
+					  })();
+				</script>
+	</head>
+    <body>	
+      <div class="navbar navbar-static-top">
+  <div class="navbar-inner">
+    <div class="container">
+	    <a href="/" class="logo"><img src="/assets/img/aurora_logo.png" alt="Apache
Aurora logo" /></a>
+      <ul class="nav">
+				<li><a href="/documentation/latest/">Documentation</a></li>
+        <li><a href="/downloads/">Download</a></li>
+        <li><a href="/community">Community</a></li>
+      </ul>
+    </div>
+  </div>
+</div>
+
+<div class="container">
+<!-- magical breadcrumbs -->
+<ul class="breadcrumb">
+  <li>
+    <div class="dropdown">
+      <a class="dropdown-toggle" data-toggle="dropdown" href="#">Apache Software Foundation
<b class="caret"></b></a>
+      <ul class="dropdown-menu" role="menu">
+        <li><a href="http://www.apache.org">Apache Homepage</a></li>
+        <li><a href="http://www.apache.org/licenses/">Apache License</a></li>
+        <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li>
 
+        <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+        <li><a href="http://www.apache.org/security/">Security</a></li>
+      </ul>
+    </div>
+  </li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://incubator.apache.org">Apache Incubator</a></li>
+  <li><span class="divider">&bull;</span></li>
+  <li><a href="http://aurora.incubator.apache.org">Apache Aurora</a></li>
+</ul>
+<!-- /breadcrumb -->
+	
+      <div class="container">
+        <h1 id="storage-configuration-and-maintenance">Storage Configuration And Maintenance</h1>
+
+<ul>
+<li><a href="#overview">Overview</a></li>
+<li><a href="#scheduler-storage-configuration-flags">Scheduler storage configuration
flags</a>
+
+<ul>
+<li><a href="#mesos-replicated-log-configuration-flags">Mesos replicated log
configuration flags</a></li>
+<li><a href="#-native_log_quorum_size">-native<em>log</em>quorum_size</a></li>
+<li><a href="#-native_log_file_path">-native<em>log</em>file_path</a></li>
+<li><a href="#-native_log_zk_group_path">-native<em>log</em>zk<em>group</em>path</a></li>
+<li><a href="#backup-configuration-flags">Backup configuration flags</a></li>
+<li><a href="#-backup_interval">-backup_interval</a></li>
+<li><a href="#-backup_dir">-backup_dir</a></li>
+<li><a href="#-max_saved_backups">-max<em>saved</em>backups</a></li>
+</ul></li>
+<li><a href="#recovering-from-a-scheduler-backup">Recovering from a scheduler
backup</a>
+
+<ul>
+<li><a href="#summary">Summary</a></li>
+<li><a href="#preparation">Preparation</a></li>
+<li><a href="#cleanup-and-re-initialize-mesos-replicated-log">Cleanup and re-initialize
Mesos replicated log</a></li>
+<li><a href="#restore-from-backup">Restore from backup</a></li>
+<li><a href="#cleanup">Cleanup</a></li>
+</ul></li>
+</ul>
+
+<h2 id="overview">Overview</h2>
+
+<p>This document summarizes Aurora storage configuration and maintenance details and
is
+intended for use by anyone deploying and/or maintaining Aurora.</p>
+
+<p>For a high level overview of the Aurora storage architecture refer to <a href="/documentation/latest/storage/">this
document</a>.</p>
+
+<h2 id="scheduler-storage-configuration-flags">Scheduler storage configuration flags</h2>
+
+<p>Below is a summary of scheduler storage configuration flags that either don&rsquo;t
have default values
+or require attention before deploying in a production environment.</p>
+
+<h3 id="mesos-replicated-log-configuration-flags">Mesos replicated log configuration
flags</h3>
+
+<h4 id="-nativelogquorum_size">-native<em>log</em>quorum_size</h4>
+
+<p>Defines the Mesos replicated log quorum size. See
+<a href="deploying-aurora-scheduler.md#replicated-log-configuration">the replicated
log configuration document</a>
+on how to choose the right value.</p>
+
+<h4 id="-nativelogfile_path">-native<em>log</em>file_path</h4>
+
+<p>Location of the Mesos replicated log files. Consider allocating a dedicated disk
(preferably SSD)
+for Mesos replicated log files to ensure optimal storage performance.</p>
+
+<h4 id="-nativelogzkgrouppath">-native<em>log</em>zk<em>group</em>path</h4>
+
+<p>ZooKeeper path used for Mesos replicated log quorum discovery.</p>
+
+<p>See <a href="../src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java">code</a>
for
+other available Mesos replicated log configuration options and default values.</p>
+
+<h3 id="backup-configuration-flags">Backup configuration flags</h3>
+
+<p>Configuration options for the Aurora scheduler backup manager.</p>
+
+<h4 id="-backup_interval">-backup_interval</h4>
+
+<p>The interval on which the scheduler writes local storage backups.  The default is
every hour.</p>
+
+<h4 id="-backup_dir">-backup_dir</h4>
+
+<p>Directory to write backups to.</p>
+
+<h4 id="-maxsavedbackups">-max<em>saved</em>backups</h4>
+
+<p>Maximum number of backups to retain before deleting the oldest backup(s).</p>
+
+<h2 id="recovering-from-a-scheduler-backup">Recovering from a scheduler backup</h2>
+
+<ul>
+<li><a href="#overview">Overview</a></li>
+<li><a href="#preparation">Preparation</a></li>
+<li><a href="#assess-mesos-replicated-log-damage">Assess Mesos replicated log
damage</a></li>
+<li><a href="#restore-from-backup">Restore from backup</a></li>
+<li><a href="#cleanup">Cleanup</a></li>
+</ul>
+
+<p><strong>Be sure to read the entire page before attempting to restore from
a backup, as it may have
+unintended consequences.</strong></p>
+
+<h3 id="summary">Summary</h3>
+
+<p>The restoration procedure replaces the existing (possibly corrupted) Mesos replicated
log with an
+earlier, backed up, version and requires all schedulers to be taken down temporarily while
+restoring. Once completed, the scheduler state resets to what it was when the backup was
created.
+This means any jobs/tasks created or updated after the backup are unknown to the scheduler
and will
+be killed shortly after the cluster restarts. All other tasks continue operating as normal.</p>
+
+<p>Usually, it is a bad idea to restore a backup that is not extremely recent (i.e.
older than a few
+hours). This is because the scheduler will expect the cluster to look exactly as the backup
does,
+so any tasks that have been rescheduled since the backup was taken will be killed.</p>
+
+<h3 id="preparation">Preparation</h3>
+
+<p>Follow these steps to prepare the cluster for restoring from a backup:</p>
+
+<ul>
+<li><p>Stop all scheduler instances</p></li>
+<li><p>Consider blocking external traffic on a port defined in <code>-http_port</code>
for all schedulers to
+prevent users from interacting with the scheduler during the restoration process. This will
help
+troubleshooting by reducing the scheduler log noise and prevent users from making changes
that will
+be erased after the backup snapshot is restored</p></li>
+<li><p>Next steps are required to put scheduler into a partially disabled state
where it would still be
+able to accept storage recovery requests but unable to schedule or change task states. This
may be
+accomplished by updating the following scheduler configuration options:</p>
+
+<ul>
+<li>Set <code>-mesos_master_address</code> to a non-existent zk address.
This will prevent scheduler from
+registering with Mesos. E.g.: <code>-mesos_master_address=zk://localhost:2181</code></li>
+<li><code>-max_registration_delay</code> - set to sufficiently long interval
to prevent registration timeout
+and as a result scheduler suicide. E.g: <code>-max_registration_delay=360min</code></li>
+<li>Make sure <code>-gc_executor_path</code> option is not set to prevent
accidental task GC. This is
+important as scheduler will attempt to reconcile the cluster state and will kill all tasks
when
+restarted with an empty Mesos replicated log.</li>
+</ul></li>
+<li><p>Restart all schedulers</p></li>
+</ul>
+
+<h3 id="cleanup-and-re-initialize-mesos-replicated-log">Cleanup and re-initialize Mesos
replicated log</h3>
+
+<p>Get rid of the corrupted files and re-initialize Mesos replicate log:</p>
+
+<ul>
+<li>Stop schedulers</li>
+<li>Delete all files under <code>-native_log_file_path</code> on all schedulers</li>
+<li>Initialize Mesos replica&rsquo;s log file: <code>mesos-log initialize
&lt;-native_log_file_path&gt;</code></li>
+<li>Restart schedulers</li>
+</ul>
+
+<h3 id="restore-from-backup">Restore from backup</h3>
+
+<p>At this point the scheduler is ready to rehydrate from the backup:</p>
+
+<ul>
+<li><p>Identify the leading scheduler by:</p>
+
+<ul>
+<li>running <code>aurora_admin get_scheduler &lt;cluster&gt;</code>
- if scheduler is responsive</li>
+<li>examining scheduler logs</li>
+<li>or examining Zookeeper registration under the path defined by <code>-zk_endpoints</code>
+and <code>-serverset_path</code></li>
+</ul></li>
+<li><p>Locate the desired backup file, copy it to the leading scheduler and stage
recovery by running
+the following command on a leader
+<code>aurora_admin scheduler_stage_recovery &lt;cluster&gt; scheduler-backup-&lt;yyyy-MM-dd-HH-mm&gt;</code></p></li>
+<li><p>At this point, the recovery snapshot is staged and available for manual
verification/modification
+via <code>aurora_admin scheduler_print_recovery_tasks</code> and <code>scheduler_delete_recovery_tasks</code>
commands.
+See <code>aurora_admin help &lt;command&gt;</code> for usage details.</p></li>
+<li><p>Commit recovery. This instructs the scheduler to overwrite the existing
Mesosreplicated log with
+the provided backup snapshot and initiate a mandatory failover
+<code>aurora_admin scheduler_commit_recovery &lt;cluster&gt;</code></p></li>
+</ul>
+
+<h3 id="cleanup">Cleanup</h3>
+
+<p>Undo any modification done during <a href="#preparation">Preparation</a>
sequence.</p>
+
+	  </div>
+      <div class="container">
+    <hr>
+    <footer class="footer">
+        <div class="row-fluid">
+            <div class="span2 text-left">
+                <h3>Links</h3>
+                <ul class="unstyled">
+                    <li><a href="/downloads/">Downloads</a></li>
+                    <li><a href="/developers/">Developers</a></li>
                   
+                </ul>
+            </div>
+            <div class="span3 text-left">
+                <h3>Community</h3>
+                <ul class="unstyled">
+                    <li><a href="/community/">Mailing Lists</a></li>
+                    <li><a href="http://issues.apache.org/jira/browse/aurora">Issue
Tracking</a></li>
+                    <li><a href="/docs/howtocontribute/">How To Contribute</a></li>
+                </ul>
+            </div>
+            <div class="span7 text-left">
+            	<h3>Apache Software Foundation</h3>
+
+							<div class="span8">
+                Copyright 2014 <a href="http://www.apache.org/">Apache Software Foundation</a>.
Licensed under the <a href="http://www.apache.org/licenses/">Apache License v2.0</a>.
Apache, Apache Thrift, and the Apache feather logo are trademarks of The Apache Software Foundation.
Currently part of the <a href="http://incubator.apache.org">Apache Incubator</a>.
+							</div>
+							<div class=" pull-right">
+								<a href="http://incubator.apache.org" class="logo"><img src="/assets/img/apache_incubator_logo.png"
alt="Apache Incubator" class="pull-right"/></a>
+							</div>
+            </div>
+
+        </div>
+
+    </footer>
+</div>
+
+	</body>
+</html>
+

Modified: incubator/aurora/site/publish/documentation/latest/storage/index.html
URL: http://svn.apache.org/viewvc/incubator/aurora/site/publish/documentation/latest/storage/index.html?rev=1634176&r1=1634175&r2=1634176&view=diff
==============================================================================
--- incubator/aurora/site/publish/documentation/latest/storage/index.html (original)
+++ incubator/aurora/site/publish/documentation/latest/storage/index.html Sat Oct 25 03:36:58
2014
@@ -69,7 +69,7 @@
 
 <ul>
 <li><a href="#overview">Overview</a></li>
-<li><a href="#reads-writes-modifications">Reads, writes, modifications&hellip;</a>
+<li><a href="#reads-writes-modifications">Reads, writes, modifications</a>
 
 <ul>
 <li><a href="#read-lifecycle">Read lifecycle</a></li>
@@ -108,12 +108,13 @@ is <a href="https://github.com/apache/th
 This helps establishing periodic recovery checkpoints and speeds up volatile storage recovery
on
 restart.</li>
 <li>Backup manager: as a precaution, snapshots are periodically written out into backup
files.
-This solves a disaster recovery problem in case of a complete loss or corruption of Mesos
log files.</li>
+This solves a <a href="storage-config.md#recovering-from-a-scheduler-backup">disaster
recovery problem</a>
+in case of a complete loss or corruption of Mesos log files.</li>
 </ul>
 
 <p><img alt="Storage hierarchy" src="../images/storage_hierarchy.png" /></p>
 
-<h2 id="reads,-writes,-modifications...">Reads, writes, modifications&hellip;</h2>
+<h2 id="reads,-writes,-modifications">Reads, writes, modifications</h2>
 
 <p>All services in Aurora access data via a set of predefined store interfaces (aka
stores) logically
 grouped by the type of data they serve. Every interface defines a specific set of operations
allowed

Modified: incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md?rev=1634176&r1=1634175&r2=1634176&view=diff
==============================================================================
--- incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md (original)
+++ incubator/aurora/site/source/documentation/latest/deploying-aurora-scheduler.md Sat Oct
25 03:36:58 2014
@@ -33,7 +33,7 @@ machines.  This guide helps you get the 
 ## Installing Aurora
 The Aurora scheduler is a standalone Java server. As part of the build process it creates
a bundle
 of all its dependencies, with the notable exceptions of the JVM and libmesos. Each target
server
-should have a JVM (Java 7 or higher) and libmesos (0.20.0) installed.
+should have a JVM (Java 7 or higher) and libmesos (0.20.1) installed.
 
 ### Creating the Distribution .zip File (Optional)
 To create a distribution for installation you will need build tools installed. On Ubuntu
this can be
@@ -112,7 +112,10 @@ should be set to `2`, and in a cluster o
 
 *Incorrectly setting this flag will cause data corruption to occur!*
 
-### Initializing the Replicated Log
+See [this document](storage-config.md#scheduler-storage-configuration-flags) for more replicated
+log and storage configuration options.
+
+## Initializing the Replicated Log
 Before you start Aurora you will also need to initialize the log on the first master.
 
     mesos-log initialize --path="$AURORA_HOME/scheduler/db"

Added: incubator/aurora/site/source/documentation/latest/storage-config.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/storage-config.md?rev=1634176&view=auto
==============================================================================
--- incubator/aurora/site/source/documentation/latest/storage-config.md (added)
+++ incubator/aurora/site/source/documentation/latest/storage-config.md Sat Oct 25 03:36:58
2014
@@ -0,0 +1,142 @@
+# Storage Configuration And Maintenance
+
+- [Overview](#overview)
+- [Scheduler storage configuration flags](#scheduler-storage-configuration-flags)
+  - [Mesos replicated log configuration flags](#mesos-replicated-log-configuration-flags)
+    - [-native_log_quorum_size](#-native_log_quorum_size)
+    - [-native_log_file_path](#-native_log_file_path)
+    - [-native_log_zk_group_path](#-native_log_zk_group_path)
+  - [Backup configuration flags](#backup-configuration-flags)
+    - [-backup_interval](#-backup_interval)
+    - [-backup_dir](#-backup_dir)
+    - [-max_saved_backups](#-max_saved_backups)
+- [Recovering from a scheduler backup](#recovering-from-a-scheduler-backup)
+  - [Summary](#summary)
+  - [Preparation](#preparation)
+  - [Cleanup and re-initialize Mesos replicated log](#cleanup-and-re-initialize-mesos-replicated-log)
+  - [Restore from backup](#restore-from-backup)
+  - [Cleanup](#cleanup)
+
+## Overview
+
+This document summarizes Aurora storage configuration and maintenance details and is
+intended for use by anyone deploying and/or maintaining Aurora.
+
+For a high level overview of the Aurora storage architecture refer to [this document](/documentation/latest/storage/).
+
+## Scheduler storage configuration flags
+
+Below is a summary of scheduler storage configuration flags that either don't have default
values
+or require attention before deploying in a production environment.
+
+### Mesos replicated log configuration flags
+
+#### -native_log_quorum_size
+Defines the Mesos replicated log quorum size. See
+[the replicated log configuration document](deploying-aurora-scheduler.md#replicated-log-configuration)
+on how to choose the right value.
+
+#### -native_log_file_path
+Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably
SSD)
+for Mesos replicated log files to ensure optimal storage performance.
+
+#### -native_log_zk_group_path
+ZooKeeper path used for Mesos replicated log quorum discovery.
+
+See [code](../src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java)
for
+other available Mesos replicated log configuration options and default values.
+
+### Backup configuration flags
+
+Configuration options for the Aurora scheduler backup manager.
+
+#### -backup_interval
+The interval on which the scheduler writes local storage backups.  The default is every hour.
+
+#### -backup_dir
+Directory to write backups to.
+
+#### -max_saved_backups
+Maximum number of backups to retain before deleting the oldest backup(s).
+
+## Recovering from a scheduler backup
+
+- [Overview](#overview)
+- [Preparation](#preparation)
+- [Assess Mesos replicated log damage](#assess-mesos-replicated-log-damage)
+- [Restore from backup](#restore-from-backup)
+- [Cleanup](#cleanup)
+
+**Be sure to read the entire page before attempting to restore from a backup, as it may have
+unintended consequences.**
+
+### Summary
+
+The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log
with an
+earlier, backed up, version and requires all schedulers to be taken down temporarily while
+restoring. Once completed, the scheduler state resets to what it was when the backup was
created.
+This means any jobs/tasks created or updated after the backup are unknown to the scheduler
and will
+be killed shortly after the cluster restarts. All other tasks continue operating as normal.
+
+Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than
a few
+hours). This is because the scheduler will expect the cluster to look exactly as the backup
does,
+so any tasks that have been rescheduled since the backup was taken will be killed.
+
+### Preparation
+
+Follow these steps to prepare the cluster for restoring from a backup:
+
+* Stop all scheduler instances
+
+* Consider blocking external traffic on a port defined in `-http_port` for all schedulers
to
+prevent users from interacting with the scheduler during the restoration process. This will
help
+troubleshooting by reducing the scheduler log noise and prevent users from making changes
that will
+be erased after the backup snapshot is restored
+
+* Next steps are required to put scheduler into a partially disabled state where it would
still be
+able to accept storage recovery requests but unable to schedule or change task states. This
may be
+accomplished by updating the following scheduler configuration options:
+  * Set `-mesos_master_address` to a non-existent zk address. This will prevent scheduler
from
+    registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:2181`
+  * `-max_registration_delay` - set to sufficiently long interval to prevent registration
timeout
+    and as a result scheduler suicide. E.g: `-max_registration_delay=360min`
+  * Make sure `-gc_executor_path` option is not set to prevent accidental task GC. This is
+    important as scheduler will attempt to reconcile the cluster state and will kill all
tasks when
+    restarted with an empty Mesos replicated log.
+
+* Restart all schedulers
+
+### Cleanup and re-initialize Mesos replicated log
+
+Get rid of the corrupted files and re-initialize Mesos replicate log:
+
+* Stop schedulers
+* Delete all files under `-native_log_file_path` on all schedulers
+* Initialize Mesos replica's log file: `mesos-log initialize <-native_log_file_path>`
+* Restart schedulers
+
+### Restore from backup
+
+At this point the scheduler is ready to rehydrate from the backup:
+
+* Identify the leading scheduler by:
+  * running `aurora_admin get_scheduler <cluster>` - if scheduler is responsive
+  * examining scheduler logs
+  * or examining Zookeeper registration under the path defined by `-zk_endpoints`
+    and `-serverset_path`
+
+* Locate the desired backup file, copy it to the leading scheduler and stage recovery by
running
+the following command on a leader
+`aurora_admin scheduler_stage_recovery <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>`
+
+* At this point, the recovery snapshot is staged and available for manual verification/modification
+via `aurora_admin scheduler_print_recovery_tasks` and `scheduler_delete_recovery_tasks` commands.
+See `aurora_admin help <command>` for usage details.
+
+* Commit recovery. This instructs the scheduler to overwrite the existing Mesosreplicated
log with
+the provided backup snapshot and initiate a mandatory failover
+`aurora_admin scheduler_commit_recovery <cluster>`
+
+### Cleanup
+Undo any modification done during [Preparation](#preparation) sequence.
+

Modified: incubator/aurora/site/source/documentation/latest/storage.md
URL: http://svn.apache.org/viewvc/incubator/aurora/site/source/documentation/latest/storage.md?rev=1634176&r1=1634175&r2=1634176&view=diff
==============================================================================
--- incubator/aurora/site/source/documentation/latest/storage.md (original)
+++ incubator/aurora/site/source/documentation/latest/storage.md Sat Oct 25 03:36:58 2014
@@ -1,7 +1,7 @@
 #Aurora Scheduler Storage
 
 - [Overview](#overview)
-- [Reads, writes, modifications...](#reads-writes-modifications)
+- [Reads, writes, modifications](#reads-writes-modifications)
   - [Read lifecycle](#read-lifecycle)
   - [Write lifecycle](#write-lifecycle)
 - [Atomicity, consistency and isolation](#atomicity-consistency-and-isolation)
@@ -33,11 +33,12 @@ is [thrift](https://github.com/apache/th
 This helps establishing periodic recovery checkpoints and speeds up volatile storage recovery
on
 restart.
 * Backup manager: as a precaution, snapshots are periodically written out into backup files.
-This solves a disaster recovery problem in case of a complete loss or corruption of Mesos
log files.
+This solves a [disaster recovery problem](storage-config.md#recovering-from-a-scheduler-backup)
+in case of a complete loss or corruption of Mesos log files.
 
 ![Storage hierarchy](images/storage_hierarchy.png)
 
-## Reads, writes, modifications...
+## Reads, writes, modifications
 
 All services in Aurora access data via a set of predefined store interfaces (aka stores)
logically
 grouped by the type of data they serve. Every interface defines a specific set of operations
allowed



Mime
View raw message