Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E56CD200C31 for ; Tue, 21 Feb 2017 21:55:15 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E411E160B68; Tue, 21 Feb 2017 20:55:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E9027160B7B for ; Tue, 21 Feb 2017 21:55:13 +0100 (CET) Received: (qmail 83838 invoked by uid 500); 21 Feb 2017 20:55:13 -0000 Mailing-List: contact commits-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list commits@aurora.apache.org Received: (qmail 83675 invoked by uid 99); 21 Feb 2017 20:55:12 -0000 Received: from Unknown (HELO svn01-us-west.apache.org) (209.188.14.144) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2017 20:55:12 +0000 Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id 75CF03A3C56 for ; Tue, 21 Feb 2017 20:55:11 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1783940 [8/20] - in /aurora/site: data/ publish/ publish/blog/ publish/blog/aurora-0-17-0-released/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publish... Date: Tue, 21 Feb 2017 20:55:06 -0000 To: commits@aurora.apache.org From: serb@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20170221205511.75CF03A3C56@svn01-us-west.apache.org> archived-at: Tue, 21 Feb 2017 20:55:16 -0000 Added: aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html?rev=1783940&view=auto ============================================================================== --- aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html (added) +++ aurora/site/publish/documentation/0.17.0/operations/backup-restore/index.html Tue Feb 21 20:54:58 2017 @@ -0,0 +1,230 @@ + + + + + + Apache Aurora + + + + + + +
+
+ +
+
+ +
+
+
+ +

Recovering from a Scheduler Backup

+ +

Be sure to read the entire page before attempting to restore from a backup, as it may have +unintended consequences.

+ +

Summary

+ +

The restoration procedure replaces the existing (possibly corrupted) Mesos replicated log with an +earlier, backed up, version and requires all schedulers to be taken down temporarily while +restoring. Once completed, the scheduler state resets to what it was when the backup was created. +This means any jobs/tasks created or updated after the backup are unknown to the scheduler and will +be killed shortly after the cluster restarts. All other tasks continue operating as normal.

+ +

Usually, it is a bad idea to restore a backup that is not extremely recent (i.e. older than a few +hours). This is because the scheduler will expect the cluster to look exactly as the backup does, +so any tasks that have been rescheduled since the backup was taken will be killed.

+ +

Instructions below have been verified in Vagrant environment and with minor +syntax/path changes should be applicable to any Aurora cluster.

+ +

Preparation

+ +

Follow these steps to prepare the cluster for restoring from a backup:

+ +
    +
  • Stop all scheduler instances

  • +
  • Consider blocking external traffic on a port defined in -http_port for all schedulers to +prevent users from interacting with the scheduler during the restoration process. This will help +troubleshooting by reducing the scheduler log noise and prevent users from making changes that will +be erased after the backup snapshot is restored.

  • +
  • Configure aurora_admin access to run all commands listed in +Restore from backup section locally on the leading scheduler:

    + +
      +
    • Make sure the clusters.json file configured to +access scheduler directly. Set scheduler_uri setting and remove zk. Since leader can get +re-elected during the restore steps, consider doing it on all scheduler replicas.
    • +
    • Depending on your particular security approach you will need to either turn off scheduler +authorization by removing scheduler -http_authentication_mechanism flag or make sure the +direct scheduler access is properly authorized. E.g.: in case of Kerberos you will need to make +a /etc/hosts file change to match your local IP to the scheduler URL configured in keytabs:

      + +

    • +
  • +
  • Next steps are required to put scheduler into a partially disabled state where it would still be +able to accept storage recovery requests but unable to schedule or change task states. This may be +accomplished by updating the following scheduler configuration options:

    + +
      +
    • Set -mesos_master_address to a non-existent zk address. This will prevent scheduler from +registering with Mesos. E.g.: -mesos_master_address=zk://localhost:1111/mesos/master
    • +
    • -max_registration_delay - set to sufficiently long interval to prevent registration timeout +and as a result scheduler suicide. E.g: -max_registration_delay=360mins
    • +
    • Make sure -reconciliation_initial_delay option is set high enough (e.g.: 365days) to +prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster +state and will kill all tasks when restarted with an empty Mesos replicated log.
    • +
  • +
  • Restart all schedulers

  • +
+ +

Cleanup and re-initialize Mesos replicated log

+ +

Get rid of the corrupted files and re-initialize Mesos replicated log:

+ +
    +
  • Stop schedulers
  • +
  • Delete all files under -native_log_file_path on all schedulers
  • +
  • Initialize Mesos replica’s log file: sudo mesos-log initialize --path=<-native_log_file_path>
  • +
  • Start schedulers
  • +
+ +

Restore from backup

+ +

At this point the scheduler is ready to rehydrate from the backup:

+ +
    +
  • Identify the leading scheduler by:

    + +
      +
    • examining the scheduler_lifecycle_LEADER_AWAITING_REGISTRATION metric at the scheduler +/vars endpoint. Leader will have 1. All other replicas - 0.
    • +
    • examining scheduler logs
    • +
    • or examining Zookeeper registration under the path defined by -zk_endpoints +and -serverset_path
    • +
  • +
  • Locate the desired backup file, copy it to the leading scheduler’s -backup_dir folder and stage +recovery by running the following command on a leader +aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>

  • +
  • At this point, the recovery snapshot is staged and available for manual verification/modification +via aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect and +scheduler_delete_recovery_tasks --bypass-leader-redirect commands. +See aurora_admin help <command> for usage details.

  • +
  • Commit recovery. This instructs the scheduler to overwrite the existing Mesos replicated log with +the provided backup snapshot and initiate a mandatory failover +aurora_admin scheduler_commit_recovery --bypass-leader-redirect <cluster>

  • +
+ +

Cleanup

+ +

Undo any modification done during Preparation sequence.

+ +
+ +
+
+