Return-Path: X-Original-To: apmail-aurora-commits-archive@minotaur.apache.org Delivered-To: apmail-aurora-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 479B4184BC for ; Sat, 12 Dec 2015 01:47:09 +0000 (UTC) Received: (qmail 29316 invoked by uid 500); 12 Dec 2015 01:47:09 -0000 Delivered-To: apmail-aurora-commits-archive@aurora.apache.org Received: (qmail 29284 invoked by uid 500); 12 Dec 2015 01:47:09 -0000 Mailing-List: contact commits-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@aurora.apache.org Delivered-To: mailing list commits@aurora.apache.org Received: (qmail 29275 invoked by uid 99); 12 Dec 2015 01:47:09 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Dec 2015 01:47:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9303D1A259D for ; Sat, 12 Dec 2015 01:47:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.247 X-Spam-Level: * X-Spam-Status: No, score=1.247 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-0.554, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id F56Kcm4RIP3U for ; Sat, 12 Dec 2015 01:46:53 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTP id E42352026F for ; Sat, 12 Dec 2015 01:46:51 +0000 (UTC) Received: from svn01-us-west.apache.org (svn.apache.org [10.41.0.6]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B683DE0BE2 for ; Sat, 12 Dec 2015 01:46:50 +0000 (UTC) Received: from svn01-us-west.apache.org (localhost [127.0.0.1]) by svn01-us-west.apache.org (ASF Mail Server at svn01-us-west.apache.org) with ESMTP id 4B35C3A2309 for ; Sat, 12 Dec 2015 01:46:50 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: svn commit: r1719617 [3/4] - in /aurora/site: publish/ publish/blog/ publish/blog/2015-upcoming-apache-aurora-meetups/ publish/blog/aurora-0-6-0-incubating-released/ publish/blog/aurora-0-7-0-incubating-released/ publish/blog/aurora-0-8-0-released/ pub... Date: Sat, 12 Dec 2015 01:46:49 -0000 To: commits@aurora.apache.org From: wfarner@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20151212014650.4B35C3A2309@svn01-us-west.apache.org> Modified: aurora/site/publish/documentation/latest/hooks/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/hooks/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/hooks/index.html (original) +++ aurora/site/publish/documentation/latest/hooks/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ +
@@ -81,7 +81,7 @@ return False. Designers/con consider whether or not to error-trap them. You can error trap at the highest level very generally and always pass the pre_ hook by returning True. For example:

-
def pre_create(...):
+
def pre_create(...):
   do_something()  # if do_something fails with an exception, the create_job is not attempted!
   return True
 
@@ -89,10 +89,11 @@ returning True. For example
 def pre_create(...):
   try:
     do_something()  # may cause exception
-  except Exception:  # generic error trap will catch it
+  except Exception:  # generic error trap will catch it
     pass  # and ignore the exception
   return True  # create_job will run in any case!
-
+
+

post_<method_name>: A post_ hook executes after its associated method successfully finishes running. If it fails, the already executed method is unaffected. A post_ hook’s error is trapped, and any later operations are unaffected.

err_<method_name>: Executes only when its associated method returns a status other than OK or throws an exception. If an err_ hook fails, the already executed method is unaffected. An err_ hook’s error is trapped, and any later operations are unaffected.

@@ -187,11 +188,12 @@ returning True. For example

By default, hooks are inactive. If you do not want to use hooks, you do not need to make any changes to your code. If you do want to use hooks, you will need to alter your .aurora config file to activate them both for the configuration as a whole as well as for individual Jobs. And, of course, you will need to define in your config file what happens when a particular hook executes.

-

.aurora Config File Settings

+

.aurora Config File Settings

You can define a top-level hooks variable in any .aurora config file. hooks is a list of all objects that define hooks used by Jobs defined in that config file. If you do not want to define any hooks for a configuration, hooks is optional.

-
hooks = [Object_with_defined_hooks1, Object_with_defined_hooks2]
-
+
hooks = [Object_with_defined_hooks1, Object_with_defined_hooks2]
+
+

Be careful when assembling a config file using include on multiple smaller config files. If there are multiple files that assign a value to hooks, only the last assignment made will stick. For example, if x.aurora has hooks = [a, b, c] and y.aurora has hooks = [d, e, f] and z.aurora has, in this order, include x.aurora and include y.aurora, the hooks value will be [d, e, f].

Also, for any Job that you want to use hooks with, its Job definition in the .aurora config file must set an enable_hooks flag to True (it defaults to False). By default, hooks are disabled and you must enable them for Jobs of your choice.

@@ -199,21 +201,24 @@ returning True. For example

To summarize, to use hooks for a particular job, you must both activate hooks for your config file as a whole, and for that job. Activating hooks only for individual jobs won’t work, nor will only activating hooks for your config file as a whole. You must also specify the hooks’ defining object in the hooks variable.

Recall that .aurora config files are written in Pystachio. So the following turns on hooks for production jobs at cluster1 and cluster2, but leaves them off for similar jobs with a defined user role. Of course, you also need to list the objects that define the hooks in your config file’s hooks variable.

-
jobs = [
-        Job(enable_hooks = True, cluster = c, env = 'prod') for c in ('cluster1', 'cluster2')
+
jobs = [
+        Job(enable_hooks = True, cluster = c, env = 'prod') for c in ('cluster1', 'cluster2')
        ]
 jobs.extend(
-   Job(cluster = c, env = 'prod', role = getpass.getuser()) for c in ('cluster1', 'cluster2'))
+   Job(cluster = c, env = 'prod', role = getpass.getuser()) for c in ('cluster1', 'cluster2'))
    # Hooks disabled for these jobs
-
+
+

Command Line

All Aurora Command Line commands now accept an .aurora config file as an optional parameter (some, of course, accept it as a required parameter). Whenever a command has a .aurora file parameter, any hooks specified and activated in the .aurora file can be used. For example:

-
aurora job restart cluster1/role/env/app myapp.aurora
-
+
aurora job restart cluster1/role/env/app myapp.aurora
+
+

The command activates any hooks specified and activated in myapp.aurora. For the restart command, that is the only thing the myapp.aurora parameter does. So, if the command was the following, since there is no .aurora config file to specify any hooks, no hooks on the restart command can run.

-
aurora job restart cluster1/role/env/app
-
+
aurora job restart cluster1/role/env/app
+
+

Hooks Protocol

Any object defined in the .aurora config file can define hook methods. You should define your hook methods within a class, and then use the class name as a value in the hooks list in your config file.

@@ -221,21 +226,23 @@ returning True. For example

Note that you can define other methods in the class that its hook methods can call; all the logic of a hook does not have to be in its definition.

The following example defines a class containing a pre_kill_job hook definition that calls another method defined in the class.

-
# Defines a method pre_kill_job
+
# Defines a method pre_kill_job
 class KillConfirmer(object):
   def confirm(self, msg):
-    return raw_input(msg).lower() == 'yes'
+    return raw_input(msg).lower() == 'yes'
 
   def pre_kill_job(self, job_key, shards=None):
-    shards = ('shards %s' % shards) if shards is not None else 'all shards'
-    return self.confirm('Are you sure you want to kill %s (%s)? (yes/no): '
+    shards = ('shards %s' % shards) if shards is not None else 'all shards'
+    return self.confirm('Are you sure you want to kill %s (%s)? (yes/no): '
                         % (job_key, shards))
-
+
+

pre_ Methods

pre_ methods have the signature:

-
pre_<API method name>(self, <associated method's signature>)
-
+
pre_<API method name>(self, <associated method's signature>)
+
+

pre_ methods have the same signature as their associated method, with the addition of self as the first parameter. See the chart above for the mapping of parameters to methods. When writing pre_ methods, you can use the * and ** syntax to designate that all unspecified parameters are passed in a list to the *ed variable and all named parameters with values are passed as name/value pairs to the **ed variable.

If this method returns False, the API command call aborts.

@@ -243,8 +250,9 @@ returning True. For example

err_ Methods

err_ methods have the signature:

-
err_<API method name>(self, exc, <associated method's signature>)
-
+
err_<API method name>(self, exc, <associated method's signature>)
+
+

err_ methods have the same signature as their associated method, with the addition of a first parameter self and a second parameter exc. exc is either a result with responseCode other than ResponseCode.OK or an Exception. See the chart above for the mapping of parameters to methods. When writing err_ methods, you can use the * and ** syntax to designate that all unspecified parameters are passed in a list to the *ed variable and all named parameters with values are passed as name/value pairs to the **ed variable.

err_ method return codes are ignored.

@@ -252,8 +260,9 @@ returning True. For example

post_ Methods

post_ methods have the signature:

-
post_<API method name>(self, result, <associated method signature>)
-
+
post_<API method name>(self, result, <associated method signature>)
+
+

post_ method parameters are self, then result, followed by the same parameter signature as their associated method. result is the result of the associated method call. See the chart above for the mapping of parameters to methods. When writing post_ methods, you can use the * and ** syntax to designate that all unspecified arguments are passed in a list to the *ed parameter and all unspecified named arguments with values are passed as name/value pairs to the **ed parameter.

post_ method return codes are ignored.

@@ -261,8 +270,9 @@ returning True. For example

Generic Hooks

There are seven Aurora API Methods which any of the three hook types can attach to. Thus, there are 21 possible hook/method combinations for a single .aurora config file. Say that you define pre_ and post_ hooks for the restart method. That leaves 19 undefined hook/method combinations; err_restart and the 3 pre_, post_, and err_ hooks for each of the other 6 hookable methods. You can define what happens when any of these otherwise undefined 19 hooks execute via a generic hook, whose signature is:

-
generic_hook(self, hook_config, event, method_name, result_or_err, args*, kw**)
-
+
generic_hook(self, hook_config, event, method_name, result_or_err, args*, kw**)
+
+

where:

    @@ -280,37 +290,40 @@ returning True. For example

Example:

-
# Overrides the standard do-nothing generic_hook by adding a log writing operation.
+
# Overrides the standard do-nothing generic_hook by adding a log writing operation.
 from twitter.common import log
   class Logger(object):
-    '''Adds to the log every time a hookable API method is called'''
+    '''Adds to the log every time a hookable API method is called'''
     def generic_hook(self, hook_config, event, method_name, result_or_err, *args, **kw)
-      log.info('%s: %s_%s of %s'
+      log.info('%s: %s_%s of %s'
                % (self.__class__.__name__, event, method_name, hook_config.job_key))
-
+
+

Hooks Process Checklist

  1. In your .aurora config file, add a hooks variable. Note that you may want to define a .aurora file only for hook definitions and then include this file in multiple other config files that you want to use the same hooks.
-
hooks = []
-
+
hooks = []
+
+
  1. In the hooks variable, list all objects that define hooks used by Jobs defined in this config:
-
hooks = [Object_hook_definer1, Object_hook_definer2]
-
+
hooks = [Object_hook_definer1, Object_hook_definer2]
+
+
  1. For each job that uses hooks in this config file, add enable_hooks = True to the Job definition. Note that this is necessary even if you only want to use the generic hook.

  2. Write your pre_, post_, and err_ hook definitions as part of an object definition in your .aurora config file.

  3. If desired, write your generic_hook definition as part of an object definition in your .aurora config file. Remember, the object must be listed as a member of hooks.

  4. If your Aurora command line command does not otherwise take an .aurora config file argument, add the appropriate .aurora file as an argument in order to define and activate the configuration’s hooks.

+
- + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/index.html (original) +++ aurora/site/publish/documentation/latest/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ + - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/monitoring/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/monitoring/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/monitoring/index.html (original) +++ aurora/site/publish/documentation/latest/monitoring/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ +
@@ -49,7 +49,7 @@ since it will give you a global view of

The scheduler exposes a lot of instrumentation data via its HTTP interface. You can get a quick peek at the first few of these in our vagrant image:

-
$ vagrant ssh -c 'curl -s localhost:8081/vars | head'
+
$ vagrant ssh -c 'curl -s localhost:8081/vars | head'
 async_tasks_completed 1004
 attribute_store_fetch_all_events 15
 attribute_store_fetch_all_events_per_sec 0.0
@@ -60,24 +60,26 @@ attribute_store_fetch_one_events 3391
 attribute_store_fetch_one_events_per_sec 0.0
 attribute_store_fetch_one_nanos_per_event 0.0
 attribute_store_fetch_one_nanos_total 454690753
-
+
+

These values are served as Content-Type: text/plain, with each line containing a space-separated metric name and value. Values may be integers, doubles, or strings (note: strings are static, others may be dynamic).

If your monitoring infrastructure prefers JSON, the scheduler exports that as well:

-
$ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
+
$ vagrant ssh -c 'curl -s localhost:8081/vars.json | python -mjson.tool | head'
 {
-    "async_tasks_completed": 1009,
-    "attribute_store_fetch_all_events": 15,
-    "attribute_store_fetch_all_events_per_sec": 0.0,
-    "attribute_store_fetch_all_nanos_per_event": 0.0,
-    "attribute_store_fetch_all_nanos_total": 3048285,
-    "attribute_store_fetch_all_nanos_total_per_sec": 0.0,
-    "attribute_store_fetch_one_events": 3409,
-    "attribute_store_fetch_one_events_per_sec": 0.0,
-    "attribute_store_fetch_one_nanos_per_event": 0.0,
-
+ "async_tasks_completed": 1009, + "attribute_store_fetch_all_events": 15, + "attribute_store_fetch_all_events_per_sec": 0.0, + "attribute_store_fetch_all_nanos_per_event": 0.0, + "attribute_store_fetch_all_nanos_total": 3048285, + "attribute_store_fetch_all_nanos_total_per_sec": 0.0, + "attribute_store_fetch_one_events": 3409, + "attribute_store_fetch_one_events_per_sec": 0.0, + "attribute_store_fetch_one_nanos_per_event": 0.0, +
+

This will be the same data as above, served with Content-Type: application/json.

Viewing live stat samples on the scheduler

@@ -118,177 +120,125 @@ recommend you start with a strict value adjust thresholds as you see fit. Feel free to ask us if you would like to validate that your alerts and thresholds make sense.

-

jvm_uptime_secs

+

Important stats

-

Type: integer counter

+

jvm_uptime_secs

-

Description

+

Type: integer counter

The number of seconds the JVM process has been running. Comes from RuntimeMXBean#getUptime()

-

Alerting

-

Detecting resets (decreasing values) on this stat will tell you that the scheduler is failing to stay alive.

-

Triage

-

Look at the scheduler logs to identify the reason the scheduler is exiting.

-

system_load_avg

+

system_load_avg

Type: double gauge

-

Description

-

The current load average of the system for the last minute. Comes from OperatingSystemMXBean#getSystemLoadAverage().

-

Alerting

-

A high sustained value suggests that the scheduler machine may be over-utilized.

-

Triage

-

Use standard unix tools like top and ps to track down the offending process(es).

-

process_cpu_cores_utilized

+

process_cpu_cores_utilized

Type: double gauge

-

Description

-

The current number of CPU cores in use by the JVM process. This should not exceed the number of logical CPU cores on the machine. Derived from OperatingSystemMXBean#getProcessCpuTime()

-

Alerting

-

A high sustained value indicates that the scheduler is overworked. Due to current internal design limitations, if this value is sustained at 1, there is a good chance the scheduler is under water.

-

Triage

-

There are two main inputs that tend to drive this figure: task scheduling attempts and status updates from Mesos. You may see activity in the scheduler logs to give an indication of where time is being spent. Beyond that, it really takes good familiarity with the code to effectively triage this. We suggest engaging with an Aurora developer.

-

task_store_LOST

+

task_store_LOST

Type: integer gauge

-

Description

-

The number of tasks stored in the scheduler that are in the LOST state, and have been rescheduled.

-

Alerting

-

If this value is increasing at a high rate, it is a sign of trouble.

-

Triage

-

There are many sources of LOST tasks in Mesos: the scheduler, master, slave, and executor can all trigger this. The first step is to look in the scheduler logs for LOST to identify where the state changes are originating.

-

scheduler_resource_offers

+

scheduler_resource_offers

Type: integer counter

-

Description

-

The number of resource offers that the scheduler has received.

-

Alerting

-

For a healthy scheduler, this value must be increasing over time.

-
Triage
-

Assuming the scheduler is up and otherwise healthy, you will want to check if the master thinks it is sending offers. You should also look at the master’s web interface to see if it has a large number of outstanding offers that it is waiting to be returned.

-

framework_registered

+

framework_registered

Type: binary integer counter

-

Description

-

Will be 1 for the leading scheduler that is registered with the Mesos master, 0 for passive schedulers,

-

Alerting

-

A sustained period without a 1 (or where sum() != 1) warrants investigation.

-

Triage

-

If there is no leading scheduler, look in the scheduler and master logs for why. If there are multiple schedulers claiming leadership, this suggests a split brain and warrants filing a critical bug.

-

rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)

+

rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)

Type: rate ratio of integer counters

-

Description

-

This composes two counters to compute a windowed figure for the latency of replicated log writes.

-

Alerting

-

A hike in this value suggests disk bandwidth contention.

-

Triage

-

Look in scheduler logs for any reported oddness with saving to the replicated log. Also use standard tools like vmstat and iotop to identify whether the disk has become slow or over-utilized. We suggest using a dedicated disk for the replicated log to mitigate this.

-

timed_out_tasks

+

timed_out_tasks

Type: integer counter

-

Description

-

Tracks the number of times the scheduler has given up while waiting (for -transient_task_state_timeout) to hear back about a task that is in a transient state (e.g. ASSIGNED, KILLING), and has moved to LOST before rescheduling.

-

Alerting

-

This value is currently known to increase occasionally when the scheduler fails over (AURORA-740). However, any large spike in this value warrants investigation.

-

Triage

-

The scheduler will log when it times out a task. You should trace the task ID of the timed out task into the master, slave, and/or executors to determine where the message was dropped.

-

http_500_responses_events

+

http_500_responses_events

Type: integer counter

-

Description

-

The total number of HTTP 500 status responses sent by the scheduler. Includes API and asset serving.

-

Alerting

-

An increase warrants investigation.

-

Triage

-

Look in scheduler logs to identify why the scheduler returned a 500, there should be a stack trace.

+
- + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/presentations/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/presentations/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/presentations/index.html (original) +++ aurora/site/publish/documentation/latest/presentations/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ +
@@ -90,11 +90,11 @@

March 25, 2014 at Aurora and Mesos Frameworks Meetup

+
- + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/resources/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/resources/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/resources/index.html (original) +++ aurora/site/publish/documentation/latest/resources/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ +
@@ -206,11 +206,11 @@ that role.

production jobs may preempt tasks from any non-production job. A production task may only be preempted by tasks from production jobs in the same role with higher priority.

+
- +
@@ -103,24 +103,26 @@ considerations.

Server Configuration

At a minimum you need to set 4 command-line flags on the scheduler:

-
-http_authentication_mechanism=BASIC
+
-http_authentication_mechanism=BASIC
 -shiro_realm_modules=INI_AUTHNZ
 -shiro_ini_path=path/to/security.ini
-
+
+

And create a security.ini file like so:

-
[users]
+
[users]
 sally = apple, admin
 
 [roles]
 admin = *
-
+
+

The details of the security.ini file are explained below. Note that this file contains plaintext, unhashed passwords.

Client Configuration

To configure the client for HTTP Basic authentication, add an entry to ~/.netrc with your credentials

-
% cat ~/.netrc
+
% cat ~/.netrc
 # ...
 
 machine aurora.example.com
@@ -128,68 +130,78 @@ login sally
 password apple
 
 # ...
-
+
+

No changes are required to clusters.json.

-

HTTP SPNEGO Authentication (Kerberos)

+

HTTP SPNEGO Authentication (Kerberos)

Server Configuration

At a minimum you need to set 6 command-line flags on the scheduler:

-
-http_authentication_mechanism=NEGOTIATE
+
-http_authentication_mechanism=NEGOTIATE
 -shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ
 -kerberos_server_principal=HTTP/aurora.example.com@EXAMPLE.COM
 -kerberos_server_keytab=path/to/aurora.example.com.keytab
 -shiro_ini_path=path/to/security.ini
-
+
+

And create a security.ini file like so:

-
% cat path/to/security.ini
+
% cat path/to/security.ini
 [users]
 sally = _, admin
 
 [roles]
 admin = *
-
+
+

What’s going on here? First, Aurora must be configured to request Kerberos credentials when presented with an unauthenticated request. This is achieved by setting

-
-http_authentication_mechanism=NEGOTIATE
-
+
-http_authentication_mechanism=NEGOTIATE
+
+

Next, a Realm module must be configured to authenticate the current request using the Kerberos credentials that were requested. Aurora ships with a realm module that can do this

-
-shiro_realm_modules=KERBEROS5_AUTHN[,...]
-
+
-shiro_realm_modules=KERBEROS5_AUTHN[,...]
+
+

The Kerberos5Realm requires a keytab file and a server principal name. The principal name will usually be in the form HTTP/aurora.example.com@EXAMPLE.COM.

-
-kerberos_server_principal=HTTP/aurora.example.com@EXAMPLE.COM
+
-kerberos_server_principal=HTTP/aurora.example.com@EXAMPLE.COM
 -kerberos_server_keytab=path/to/aurora.example.com.keytab
-
+
+

The Kerberos5 realm module is authentication-only. For scheduler security to work you must also enable a realm module that provides an Authorizer implementation. For example, to do this using the IniShiroRealmModule:

-
-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ
-
+
-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ
+
+

You can then configure authorization using a security.ini file as described below (the password field is ignored). You must configure the realm module with the path to this file:

-
-shiro_ini_path=path/to/security.ini
-
+
-shiro_ini_path=path/to/security.ini
+
+

Client Configuration

To use Kerberos on the client-side you must build Kerberos-enabled client binaries. Do this with

-
./pants binary src/main/python/apache/aurora/client/cli:kaurora
-./pants binary src/main/python/apache/aurora/admin:kaurora_admin
-
+
./pants binary src/main/python/apache/aurora/kerberos:kaurora
+./pants binary src/main/python/apache/aurora/kerberos:kaurora_admin
+
+

You must also configure each cluster where you’ve enabled Kerberos on the scheduler to use Kerberos authentication. Do this by setting auth_mechanism to KERBEROS in clusters.json.

-
% cat ~/.aurora/clusters.json
+
% cat ~/.aurora/clusters.json
 {
-    "devcluser": {
-        "auth_mechanism": "KERBEROS",
+    "devcluser": {
+        "auth_mechanism": "KERBEROS",
         ...
     },
     ...
 }
-
+
+

Authorization

Given a means to authenticate the entity a client claims they are, we need to define what privileges they have.

@@ -202,16 +214,17 @@ likely the preferred approach. However are rapidly changing, or if your access control information already exists in another system.

You can enable INI-based configuration with following scheduler command line arguments:

-
-http_authentication_mechanism=BASIC
+
-http_authentication_mechanism=BASIC
 -shiro_ini_path=path/to/security.ini
-
+
+

note As the argument name reveals, this is using Shiro’s IniRealm behind the scenes.

The INI file will contain two sections - users and roles. Here’s an example for what might be in security.ini:

-
[users]
+
[users]
 sally = apple, admin
 jim = 123456, accounting
 becky = letmein, webapp
@@ -222,7 +235,8 @@ steve = password
 admin = *
 accounting = thrift.AuroraAdmin:setQuota
 webapp = thrift.AuroraSchedulerManager:*:webapp
-
+
+

The users section defines user user credentials and the role(s) they are members of. These lines are of the format <user> = <password>[, <role>...]. As you probably noticed, the passwords are in plaintext and as a result read access to this file should be restricted.

@@ -254,7 +268,7 @@ for more information.

Packaging a realm module

Package your custom Realm(s) with a Guice module that exposes a Set<Realm> multibinding.

-
package com.example;
+
package com.example;
 
 import com.google.inject.AbstractModule;
 import com.google.inject.multibindings.Multibinder;
@@ -272,11 +286,13 @@ for more information.

// Realm implementation. } } -
+
+

To use your module in the scheduler, include it as a realm module based on its fully-qualified class name:

-
-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ,com.example.MyRealmModule
-
+
-shiro_realm_modules=KERBEROS5_AUTHN,INI_AUTHNZ,com.example.MyRealmModule
+
+

Known Issues

While the APIs and SPIs we ship with are stable as of 0.8.0, we are aware of several incremental @@ -289,11 +305,11 @@ improvements. Please follow, vote, or se * AURORA-1293: Consider defining a JSON format in place of INI * AURORA-1179: Supported hashed passwords in security.ini * AURORA-1295: Support security for the ReadOnlyScheduler service

+
- + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/sla/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/sla/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/sla/index.html (original) +++ aurora/site/publish/documentation/latest/sla/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
-
+ +
@@ -60,8 +60,9 @@ Agreements) metrics that defining a contractual relationship between the Aurora/Mesos platform and hosted services.

-

The Aurora SLA feature currently supports stat collection only for service (non-cron) -production jobs ("production = True" in your .aurora config).

+

The Aurora SLA feature is by default only enabled for service (non-cron) +production jobs ("production = True" in your .aurora config). It can be enabled for +non-production services via the scheduler command line flag -sla_non_prod_metrics.

Counters that track SLA measurements are computed periodically within the scheduler. The individual instance metrics are refreshed every minute (configurable via @@ -145,7 +146,7 @@ percentiles (50th,75th,90th,95th and 99t You can also get customized real-time stats from aurora client. See aurora sla -h for more details.

-

Median Time To Assigned (MTTA)

+

Median Time To Assigned (MTTA)

Median time a job spends waiting for its tasks to be assigned to a host. This is a combined metric that helps track the dependency of scheduling performance on the requested resources @@ -187,7 +188,7 @@ metric that helps track the dependency o that are still PENDING. This ensures straggler instances (e.g. with unreasonable resource constraints) do not affect metric curves.

-

Median Time To Running (MTTR)

+

Median Time To Running (MTTR)

Median time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric reflecting on the overall time it takes for the Aurora/Mesos to start executing user content.

@@ -234,11 +235,11 @@ unreasonable resource constraints) do no
  • All metrics are calculated at a pre-defined interval (currently set at 1 minute). Scheduler restarts may result in missed collections.

  • +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/storage-config/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/storage-config/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/storage-config/index.html (original) +++ aurora/site/publish/documentation/latest/storage-config/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -80,18 +80,18 @@ or require attention before deploying in

    Mesos replicated log configuration flags

    -

    -nativelogquorum_size

    +

    -nativelogquorum_size

    Defines the Mesos replicated log quorum size. See the replicated log configuration document on how to choose the right value.

    -

    -nativelogfile_path

    +

    -nativelogfile_path

    Location of the Mesos replicated log files. Consider allocating a dedicated disk (preferably SSD) for Mesos replicated log files to ensure optimal storage performance.

    -

    -nativelogzkgrouppath

    +

    -nativelogzkgrouppath

    ZooKeeper path used for Mesos replicated log quorum discovery.

    @@ -102,15 +102,15 @@ other available Mesos replicated log con

    Configuration options for the Aurora scheduler backup manager.

    -

    -backup_interval

    +

    -backup_interval

    The interval on which the scheduler writes local storage backups. The default is every hour.

    -

    -backup_dir

    +

    -backup_dir

    Directory to write backups to.

    -

    -maxsavedbackups

    +

    -maxsavedbackups

    Maximum number of backups to retain before deleting the oldest backup(s).

    @@ -157,10 +157,10 @@ accomplished by updating the following s
  • Set -mesos_master_address to a non-existent zk address. This will prevent scheduler from registering with Mesos. E.g.: -mesos_master_address=zk://localhost:2181
  • -max_registration_delay - set to sufficiently long interval to prevent registration timeout -and as a result scheduler suicide. E.g: -max_registration_delay=360min
  • -
  • Make sure -gc_executor_path option is not set to prevent accidental task GC. This is -important as scheduler will attempt to reconcile the cluster state and will kill all tasks when -restarted with an empty Mesos replicated log.
  • +and as a result scheduler suicide. E.g: -max_registration_delay=360mins +
  • Make sure -reconciliation_initial_delay option is set high enough (e.g.: 365days) to +prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster +state and will kill all tasks when restarted with an empty Mesos replicated log.
  • Restart all schedulers

  • @@ -172,7 +172,7 @@ restarted with an empty Mesos replicated
    • Stop schedulers
    • Delete all files under -native_log_file_path on all schedulers
    • -
    • Initialize Mesos replica’s log file: mesos-log initialize <-native_log_file_path>
    • +
    • Initialize Mesos replica’s log file: mesos-log initialize --path=<-native_log_file_path>
    • Restart schedulers
    @@ -203,11 +203,11 @@ the provided backup snapshot and initiat

    Cleanup

    Undo any modification done during Preparation sequence.

    +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/storage/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/storage/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/storage/index.html (original) +++ aurora/site/publish/documentation/latest/storage/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -135,11 +135,11 @@ volatile and replicated writes to succee

    Any time a scheduler restarts, it restores its volatile state from the most recent position recorded in the replicated log by restoring the snapshot and replaying individual log entries on top to fully recover the state up to the last write.

    +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/test-resource-generation/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/test-resource-generation/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/test-resource-generation/index.html (original) +++ aurora/site/publish/documentation/latest/test-resource-generation/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -46,9 +46,8 @@

    The Aurora source repository and distributions contain several binary files to qualify the backwards-compatibility of thermos with checkpoint data. Since -thermos persists state to disk, to be read by other components (the GC executor -and the thermos observer), it is important that we have tests that prevent -regressions affecting the ability to parse previously-written data.

    +thermos persists state to disk, to be read by the thermos observer), it is important that we have +tests that prevent regressions affecting the ability to parse previously-written data.

    Generating test files

    @@ -66,11 +65,11 @@ accomplished by writing and running a job configuration that exercises the feature, and copying the checkpoint file from the sandbox directory, by default this is /var/run/thermos/checkpoints/<aurora task id>.

    +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/thrift-deprecation/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/thrift-deprecation/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/thrift-deprecation/index.html (original) +++ aurora/site/publish/documentation/latest/thrift-deprecation/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -94,11 +94,11 @@ See this document for more.

    +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/tutorial/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/tutorial/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/tutorial/index.html (original) +++ aurora/site/publish/documentation/latest/tutorial/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -89,21 +89,22 @@ this directory is the same as /vag -
    import sys
    +
    import sys
     import time
     
     def main(argv):
       SLEEP_DELAY = 10
       # Python ninjas - ignore this blatant bug.
       for i in xrang(100):
    -    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
    +    print("Hello world! The time is now: %s. Sleeping for %d secs" % (
           time.asctime(), SLEEP_DELAY))
         sys.stdout.flush()
         time.sleep(SLEEP_DELAY)
     
    -if __name__ == "__main__":
    +if __name__ == "__main__":
       main(sys.argv)
    -
    +
    +

    Aurora Configuration

    Once we have our script/program, we need to create a configuration @@ -112,24 +113,24 @@ code in the file hello_world.auror -

    pkg_path = '/vagrant/hello_world.py'
    +
    pkg_path = '/vagrant/hello_world.py'
     
     # we use a trick here to make the configuration change with
     # the contents of the file, for simplicity.  in a normal setting, packages would be
     # versioned, and the version number would be changed in the configuration.
     import hashlib
    -with open(pkg_path, 'rb') as f:
    -  pkg_checksum = hashlib.md5(f.read()).hexdigest()
    +with open(pkg_path, 'rb') as f:
    +  pkg_checksum = hashlib.md5(f.read()).hexdigest()
     
     # copy hello_world.py into the local sandbox
     install = Process(
    -  name = 'fetch_package',
    -  cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, pkg_checksum))
    +  name = 'fetch_package',
    +  cmdline = 'cp %s . && echo %s && chmod +x hello_world.py' % (pkg_path, pkg_checksum))
     
     # run the script
     hello_world = Process(
    -  name = 'hello_world',
    -  cmdline = 'python hello_world.py')
    +  name = 'hello_world',
    +  cmdline = 'python hello_world.py')
     
     # describe the task
     hello_world_task = SequentialTask(
    @@ -137,19 +138,20 @@ code in the file hello_world.auror
       resources = Resources(cpu = 1, ram = 1*MB, disk=8*MB))
     
     jobs = [
    -  Service(cluster = 'devcluster',
    -          environment = 'devel',
    -          role = 'www-data',
    -          name = 'hello_world',
    +  Service(cluster = 'devcluster',
    +          environment = 'devel',
    +          role = 'www-data',
    +          name = 'hello_world',
               task = hello_world_task)
     ]
    -
    +
    +

    For more about Aurora configuration files, see the Configuration Tutorial and the Aurora + Thermos Reference (preferably after finishing this tutorial).

    -

    What’s Going On In That Configuration File?

    +

    What’s Going On In That Configuration File?

    More than you might think.

    @@ -182,19 +184,22 @@ identical, the job keys identify the sam

    /etc/aurora/clusters.json within the Aurora scheduler has the available cluster names. For Vagrant, from the top-level of your Aurora repository clone, do:

    -
    $ vagrant ssh
    -
    +
    $ vagrant ssh
    +
    +

    Followed by:

    -
    vagrant@precise64:~$ cat /etc/aurora/clusters.json
    -
    +
    vagrant@precise64:~$ cat /etc/aurora/clusters.json
    +
    +

    You’ll see something like:

    -
    [{
    -  "name": "devcluster",
    -  "zk": "192.168.33.7",
    -  "scheduler_zk_path": "/aurora/scheduler",
    -  "auth_mechanism": "UNAUTHENTICATED"
    +
    [{
    +  "name": "devcluster",
    +  "zk": "192.168.33.7",
    +  "scheduler_zk_path": "/aurora/scheduler",
    +  "auth_mechanism": "UNAUTHENTICATED"
     }]
    -
    +
    +

    Use a name value for your job key’s cluster value.

    Role names are user accounts existing on the slave machines. If you don’t know what accounts @@ -204,13 +209,15 @@ are available, contact your sysadmin.

    The Aurora Client command that actually runs our Job is aurora job create. It creates a Job as specified by its job key and configuration file arguments and runs it.

    -
    aurora job create <cluster>/<role>/<environment>/<jobname> <config_file>
    -
    +
    aurora job create <cluster>/<role>/<environment>/<jobname> <config_file>
    +
    +

    Or for our example:

    -
    aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
    -
    +
    aurora job create devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
    +
    +

    This returns:

    -
    $ vagrant ssh
    +
    $ vagrant ssh
     Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)
     
      * Documentation:  https://help.ubuntu.com/
    @@ -222,7 +229,8 @@ vagrant@precise64:~$ aurora job create d
      INFO] Response from scheduler: OK (message: 1 new tasks pending for job
       www-data/devel/hello_world)
      INFO] Job url: http://precise64:8081/scheduler/www-data/devel/hello_world
    -
    +
    +

    Watching the Job Run

    Now that our job is running, let’s see what it’s doing. Access the @@ -258,8 +266,9 @@ to stderr on the failed It looks like we made a typo in our Python script. We wanted xrange, not xrang. Edit the hello_world.py script to use the correct function and we will try again.

    -
    aurora job update devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
    -
    +
    aurora job update devcluster/www-data/devel/hello_world /vagrant/hello_world.aurora
    +
    +

    This time, the task comes up, we inspect the page, and see that the hello_world process is running.

    @@ -273,12 +282,13 @@ output:

    Cleanup

    Now that we’re done, we kill the job using the Aurora client:

    -
    vagrant@precise64:~$ aurora job killall devcluster/www-data/devel/hello_world
    +
    vagrant@precise64:~$ aurora job killall devcluster/www-data/devel/hello_world
      INFO] Killing tasks for job: devcluster/www-data/devel/hello_world
      INFO] Response from scheduler: OK (message: Tasks killed.)
      INFO] Job url: http://precise64:8081/scheduler/www-data/devel/hello_world
     vagrant@precise64:~$
    -
    +
    +

    The job page now shows the hello_world tasks as completed.

    Killed Task page

    @@ -296,11 +306,11 @@ Thermos work “under the hood&rdquo
  • Explore the Aurora Client - use aurora -h, and read the Aurora Client Commands document.
  • +
    - + \ No newline at end of file Modified: aurora/site/publish/documentation/latest/user-guide/index.html URL: http://svn.apache.org/viewvc/aurora/site/publish/documentation/latest/user-guide/index.html?rev=1719617&r1=1719616&r2=1719617&view=diff ============================================================================== --- aurora/site/publish/documentation/latest/user-guide/index.html (original) +++ aurora/site/publish/documentation/latest/user-guide/index.html Sat Dec 12 01:46:48 2015 @@ -21,12 +21,11 @@ -
    -
    + +
    @@ -244,14 +244,16 @@ from the point where the update failed.

    The Executor implements a protocol for rudimentary control of a task via HTTP. Tasks subscribe for this protocol by declaring a port named health. Take for example this configuration snippet:

    -
    nginx = Process(
    -  name = 'nginx',
    -  cmdline = './run_nginx.sh -port {{thermos.ports[http]}}')
    -
    +
    nginx = Process(
    +  name = 'nginx',
    +  cmdline = './run_nginx.sh -port {{thermos.ports[http]}}')
    +
    +

    When this Process is included in a job, the job will be allocated a port, and the command line will be replaced with something like:

    -
    ./run_nginx.sh -port 42816
    -
    +
    ./run_nginx.sh -port 42816
    +
    +

    Where 42816 happens to be the allocated. port. Typically, the Executor monitors Processes within a task only by liveness of the forked process. However, when a health port was allocated, it will also send periodic HTTP health checks. A task requesting a health port must handle the following @@ -398,18 +400,20 @@ about the Aurora Client.

    Part of the output from creating a new Job is a URL for the Job’s scheduler UI page.

    For example:

    -
      vagrant@precise64:~$ aurora job create devcluster/www-data/prod/hello \
    +
      vagrant@precise64:~$ aurora job create devcluster/www-data/prod/hello \
       /vagrant/examples/jobs/hello_world.aurora
       INFO] Creating job hello
       INFO] Response from scheduler: OK (message: 1 new tasks pending for job www-data/prod/hello)
       INFO] Job url: http://precise64:8081/scheduler/www-data/prod/hello
    -
    +
    +

    The “Job url” goes to the Job’s scheduler UI page. To go to the overall scheduler UI page, stop at the “scheduler” part of the URL, in this case, http://precise64:8081/scheduler

    You can also reach the scheduler UI page via the Client command aurora job open:

    -
      aurora job open [<cluster>[/<role>[/<env>/<job_name>]]]
    -
    +
      aurora job open [<cluster>[/<role>[/<env>/<job_name>]]]
    +
    +

    If only the cluster is specified, it goes directly to that cluster’s scheduler main page. If the role is specified, it goes to the top-level role page. If the full job key is specified, it goes directly to the job page where you can inspect individual tasks.

    @@ -423,11 +427,11 @@ about the Aurora Client.

    See client commands.

    +
    - + \ No newline at end of file