aurora-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sirois (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AURORA-1605) Update recovery docs to reflect changes
Date Wed, 03 Feb 2016 17:47:39 GMT

    [ https://issues.apache.org/jira/browse/AURORA-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130751#comment-15130751
] 

John Sirois commented on AURORA-1605:
-------------------------------------

I went through the docs using test_kerberos_end_to_end.sh and hit a few roadblocks / things
that do not jive with the description in this ticket.  I'm sure I'm missing obvious things,
but if not, my experience is detailed below.

h5. Setup environment to test recovery
# I edited test_kerberos_end_to_end.sh to skip tear-down and then ran it to setup the kerberized
scheduler
# I ssh'd to vagrant and ran the steps through setup manually to get kinit'd as root for the
aurora_admin commands I'd need to run, ie roughly:
{noformat}
cd ~/krb5-1.13.1/build
make testrealm
SCHEDULER_HOSTNAME=aurora.local
kadmin.local -q "addprinc -randkey HTTP/$SCHEDULER_HOSTNAME"
rm -f testdir/HTTP-$SCHEDULER_HOSTNAME.keytab.keytab
kadmin.local -q "ktadd -keytab testdir/HTTP-$SCHEDULER_HOSTNAME.keytab HTTP/$SCHEDULER_HOSTNAME"
kadmin.local -q "addprinc -randkey root"
rm -f testdir/root.keytab
kadmin.local -q "ktadd -keytab testdir/root.keytab root"
kinit -k -t "testdir/root.keytab" root
{noformat}
# aurora_admin scheduler_backup_now devcluster && aurora_admin scheduler_list_backups
devcluster

h5. Do a restore

I ran through the restore docs as with details below:

h6. Preparation

{noformat}
$ diff /etc/init/aurora-scheduler-kerberos.conf /etc/init/aurora-scheduler-kerberos.pre-recovery.conf

42,44c42
<   -mesos_master_address=zk://localhost:181/mesos/master \
<   -max_registration_delay=365days \
<   -reconciliation_initial_delay=365days \
---
>   -mesos_master_address=zk://localhost:2181/mesos/master \
{noformat}

h6. Restore from backup

The leading scheduler could only be identifed via logs:
{noformat}
sudo grep "Elected as leading scheduler" /var/log/upstart/aurora-scheduler-kerberos.log |
tail -1
    I0203 16:57:05.336 [main, SchedulerLifecycle$5:238] Elected as leading scheduler!
{noformat}
or examining zk nodes:
{noformat}
/usr/share/zookeeper/bin/zkCli.sh ls /aurora/scheduler                               
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[singleton_candidate_0000000027]
/usr/share/zookeeper/bin/zkCli.sh get /aurora/scheduler/singleton_candidate_0000000027
...
Connecting to localhost:2181

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
127.0.1.1
cZxid = 0x17b
ctime = Wed Feb 03 17:12:43 UTC 2016
mZxid = 0x17b
mtime = Wed Feb 03 17:12:43 UTC 2016
pZxid = 0x17b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x152a7d138160090
dataLength = 9
numChildren = 0
{noformat}
All aurora_admin commands fail at this point though with this flavor (ie: {{aurora_admin get_scheduler}},
{{aurora_admin scheduler_list_backups}}, etc.) :
{noformat}
aurora_admin scheduler_stage_recovery -v --bypass-leader-redirect devcluster scheduler-backup-2016-02-03-16-32
DEBUG] Using auth module: <apache.aurora.kerberos.auth_module.KerberosAuthModule object
at 0x2b628a0b6290>
 INFO] Connecting to 192.168.33.7:2181
 INFO] Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=10000,
session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None)
 INFO] Zookeeper connection established, state: CONNECTED
 INFO] Sending request(xid=1): GetChildren(path=u'/aurora/scheduler', watcher=<function
get_watch at 0x2b628a0d6488>)
 INFO] Received response(xid=1): [u'singleton_candidate_0000000022']
 INFO] Sending request(xid=2): GetChildren(path=u'/aurora/scheduler', watcher=None)
 INFO] Received response(xid=2): [u'singleton_candidate_0000000022']
 WARN] Could not connect to scheduler: No schedulers detected in devcluster!
{noformat}

As a result, the only way to complete the rest of the guide was to re-edit {{/etc/init/aurora-scheduler-kerberos.conf}}
and restore the correct {{-mesos_master_address}}.  After doing this and bouncing the scheduler
I could run aurora_admin commands and successfully complete the restore via the rest of the
guide.

So... it seems to me the guide needs to - at a high-level, suggest:
# All schedulers are stopped (say 5 of them).
# All but one scheduler (4 in this example) as in "Preparation", but 1 scheduler is prepared
as in "Preparation" except for the bit about setting an invalid {{-mesos_master_address}}
and with the addition of emphasizing the bit about port-blocking to prevent user-activity.
 This special scheduler will be used to run the recovery staging, review and commit.

If I have this approximately right, I concure with [~StephanErb]'s second comment above -
the 1st "Identify the leading scheduler by" will then always work, ie {{aurora_admin get_scheduler}}
- but its beside the point since the preparation already singled out a leader to run the recovery
against.

This leads me to think the purpose of the "Identify the leading scheduler by" section is to
find the _last_ leading scheduler before recovery operations to then go to that machine and
find the latest backup file.  That file is then copied over to the recovery leading scheduler.

> Update recovery docs to reflect changes
> ---------------------------------------
>
>                 Key: AURORA-1605
>                 URL: https://issues.apache.org/jira/browse/AURORA-1605
>             Project: Aurora
>          Issue Type: Task
>          Components: Documentation
>            Reporter: Joshua Cohen
>            Priority: Minor
>
> We had to restore one of our clusters from backup recently, and it turns out there's
been some drift between the [documented process](https://github.com/apache/aurora/blob/f630bf705ac8a9de2b7b987858ada3b876f65abf/docs/storage-config.md#recovering-from-a-scheduler-backup)
and what's currently necessary.
> Specifically, we needed to disable the leader redirect filter and, I believe, mesos authentication.
> We should make sure the recovery docs are up to date with what's actually required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message