mesos-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From git-site-r...@apache.org
Subject mesos-site git commit: Updated the website built from mesos SHA: 8447d19.
Date Tue, 15 Aug 2017 18:37:45 GMT
Repository: mesos-site
Updated Branches:
  refs/heads/asf-site 7099eafee -> 004423831


Updated the website built from mesos SHA: 8447d19.


Project: http://git-wip-us.apache.org/repos/asf/mesos-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos-site/commit/00442383
Tree: http://git-wip-us.apache.org/repos/asf/mesos-site/tree/00442383
Diff: http://git-wip-us.apache.org/repos/asf/mesos-site/diff/00442383

Branch: refs/heads/asf-site
Commit: 0044238312753454e0d9d8589a95f7c088c9c983
Parents: 7099eaf
Author: jenkins <builds@apache.org>
Authored: Tue Aug 15 18:37:44 2017 +0000
Committer: jenkins <builds@apache.org>
Committed: Tue Aug 15 18:37:44 2017 +0000

----------------------------------------------------------------------
 content/documentation/health-checks/index.html  | 425 +++++++++++++++----
 .../latest/health-checks/index.html             | 425 +++++++++++++++----
 2 files changed, 698 insertions(+), 152 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos-site/blob/00442383/content/documentation/health-checks/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/health-checks/index.html b/content/documentation/health-checks/index.html
index dc9e79f..df5117c 100644
--- a/content/documentation/health-checks/index.html
+++ b/content/documentation/health-checks/index.html
@@ -2,7 +2,7 @@
 <html>
   <head>
     <meta charset="utf-8">
-    <title>Apache Mesos - Task Health Checking</title>
+    <title>Apache Mesos - Task Health Checking and Generalized Checks</title>
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
     <meta property="og:locale" content="en_US"/>
@@ -113,7 +113,7 @@
     <p>See our <a href="/community/">community</a> page for more details.</p>
   </div>
   <div class="col-md-8">
-    <h1>Task Health Checking</h1>
+    <h1>Task Health Checking and Generalized Checks</h1>
 
 <p>Sometimes applications crash, misbehave, or become unresponsive. To detect and
 recover from such situations, some frameworks (e.g.,
@@ -126,7 +126,7 @@ executor to respond to the ping. Although this technique is extremely useful,
 there are several disadvantages in the way it is usually implemented:</p>
 
 <ul>
-<li>Each Mesos framework uses its own API and protocol.</li>
+<li>Each Apache Mesos framework uses its own API and protocol.</li>
 <li>Framework developers have to reimplement common functionality.</li>
 <li>Health checks originating from a scheduler generate extra network traffic if
 the task and the scheduler run on different nodes (which is usually the case);
@@ -138,11 +138,16 @@ health checks for every task can cause scheduler performance problems.</li>
 </ul>
 
 
-<p>To address the aforementioned problems, Mesos 1.2.0 introduces
-<a href="#mesos-native-health-checks">the Mesos-native health check design</a>, defines
-common API for <a href="#command-health-checks">command</a>, <a href="#http-health-checks">HTTP(S)</a>,
-and <a href="#tcp-health-checks">TCP</a> health checks, and provides reference
-implementations for all built-in executors.</p>
+<p>To address the aforementioned problems, Mesos 1.2.0 introduced
+<a href="#mesos-native-checking">the Mesos-native health check design</a>, defined
+common API for <a href="#command-health-checks">command</a>,
+<a href="#http-health-checks">HTTP(S)</a>, and <a href="#tcp-health-checks">TCP</a> health checks,
+and provided reference implementations for all built-in executors.</p>
+
+<p>Mesos 1.4.0 introduced <a href="#anatomy-of-a-check">a generalized check</a>, which
+delegates interpretation of a check result to the framework. This might be
+useful, for instance, to track tasks' internal state transitions reliably
+without Mesos taking action on them.</p>
 
 <p><strong>NOTE:</strong> Some functionality related to health checking was available prior to
 1.2.0 release, however it was considered experimental.</p>
@@ -152,29 +157,294 @@ using an equivalent of a <code>waitpid()</code> system call. This technique allo
 detecting and reporting process crashes, but is insufficient for cases when the
 process is still running but is not responsive.</p>
 
-<p>This document describes supported health check types, touches on relevant
-implementation details, and mentions limitations and caveats.</p>
+<p>This document describes supported check and health check types, touches on
+relevant implementation details, and mentions limitations and caveats.</p>
 
-<p><a name="mesos-native-health-checks"></a></p>
+<p><a name="mesos-native-checking"></a></p>
 
-<h2>Mesos-native Health Checks</h2>
+<h2>Mesos-native Task Checking</h2>
 
 <p>In contrast to the state-of-the-art &ldquo;scheduler health check&rdquo; pattern mentioned
-above, Mesos-native health checks run on the agent node: it is the executor
+above, Mesos-native checks run on the agent node: it is the executor
 which performs checks and not the scheduler. This improves scalability but means
 that detecting network faults or task availability from the outside world
 becomes a separate concern. For instance, if the task is running on a
-partitioned agent, it will still be health checked and&mdash;if those health checks
-fail&mdash;will be terminated. Needless to say that due to the network partition,
+partitioned agent, it will still be (health) checked and&mdash;if the health checks
+fail&mdash;might be terminated. Needless to say that due to the network partition,
 all this will happen without the framework scheduler being notified.</p>
 
-<p>Task status updates are leveraged to transfer the health check status to the
-Mesos master and further to the framework&rsquo;s scheduler ensuring the
-&ldquo;at-least-once&rdquo; delivery guarantee. The boolean <code>healthy</code> field is used to
-convey health status, which <a href="#current-limitations">may be insufficient</a> in
-certain cases. This means a task that has failed health checks will be <code>RUNNING</code>
-with <code>healthy</code> set to <code>false</code>. Currently, the <code>healthy</code> field is only set for
-<code>TASK_RUNNING</code> status updates.</p>
+<p>Mesos checks and health checks are described in
+<a href="https://github.com/apache/mesos/blob/cdb90b91ce8ce02d6163e5e2ee5b46fb797b1dee/include/mesos/mesos.proto#L403-L485"><code>CheckInfo</code></a>
+and <a href="https://github.com/apache/mesos/blob/cdb90b91ce8ce02d6163e5e2ee5b46fb797b1dee/include/mesos/mesos.proto#L488-L589"><code>HealthCheck</code></a>
+protobufs respectively. Currently, only tasks can be (health) checked, not
+arbitrary processes or executors, i.e., only the <code>TaskInfo</code> protobuf has the
+optional <code>CheckInfo</code> and <code>HealthCheck</code> fields. However, it is worth noting that
+all built-in executors map a task to a process.</p>
+
+<p>Task status updates are leveraged to transfer the check and health check status
+to the Mesos master and further to the framework&rsquo;s scheduler ensuring the
+&ldquo;at-least-once&rdquo; delivery guarantee. To minimize performance overhead, those task
+status updates are triggered if a certain condition is met, e.g., the value or
+presence of a specific field in the check status changes.</p>
+
+<p>When a built-in executor sends a task status update because the check or health
+check status has changed, it sets <code>TaskStatus.reason</code> to
+<code>REASON_TASK_CHECK_STATUS_UPDATED</code> or <code>REASON_TASK_HEALTH_CHECK_STATUS_UPDATED</code>
+respectively. While sending such an update, the executor avoids shadowing other
+data that might have been injected previously, e.g., a check update includes the
+last known update from a health check.</p>
+
+<p>It is the responsibility of the executor to interpret <code>CheckInfo</code> and
+<code>HealthCheckInfo</code> and perform checks appropriately. All built-in executors
+support health checking their tasks and all except the docker executor support
+generalized checks (see <a href="#under-the-hood">implementation details</a> and
+<a href="#current-limitations">limitations</a>).</p>
+
+<p><strong>NOTE:</strong> It is up to the executor how&mdash;and whether at all&mdash;to honor the
+<code>CheckInfo</code> and <code>HealthCheck</code> fields in <code>TaskInfo</code>. Implementations may vary
+significantly depending on what entity <code>TaskInfo</code> represents. On this page only
+the reference implementation for built-in executors is considered.</p>
+
+<p>Custom executors can use <a href="#under-the-hood">the checker library</a>, the reference
+implementation for health checking that all built-in executors rely on.</p>
+
+<h3>On the Differences Between Checks and Health Checks</h3>
+
+<p>When humans read data from a sensor, they may interpret these data and act on
+them. For example, if they check air temperature, they usually interpret
+temperature readings and say whether it’s cold or warm outside; they may also
+act on the interpretation and decide to apply sunscreen or put on an extra
+jacket.</p>
+
+<p>Similar reasoning can be applied to checking task’s state in Mesos:</p>
+
+<ol>
+<li>Perform a check.</li>
+<li>Optionally interpret the result and, for example, declare the task either
+healthy or unhealthy.</li>
+<li>Optionally act on the interpretation by killing an unhealthy task.</li>
+</ol>
+
+
+<p>Mesos health checks do all of the above, 1+2+3: they run the check, declare the
+task healthy or not, and kill it after <code>consecutive_failures</code> have occurred.
+Though efficient and scalable, this strategy is inflexible for the needs of
+frameworks which may want to run an arbitrary check without Mesos interpreting
+the result in any way, for example, to transmit the task’s internal state
+transitions and make global decisions.</p>
+
+<p>Conceptually, a health check is a check with an interpretation and a kill
+policy. A check and a health check differ in how they are specified and
+implemented:</p>
+
+<ul>
+<li>Built-in executors do not (and custom executors shall not) interpret the
+result of a check. If they do, it should be a health check.</li>
+<li>There is no concept of a check failure, hence grace period and consecutive
+failures options are only available for health checks. Note that a check can
+still time out (a health check interprets timeouts as failures), in this case
+an empty result is sent to the scheduler.</li>
+<li>Health checks do not propagate the result of the underlying check to the
+scheduler, only its interpretation: healthy or unhealthy. Note that this may
+change in the future.</li>
+<li>Health check updates are deduplicated based on the interpretation and not the
+result of the underlying check, i.e., given that only HTTP <code>4**</code> status codes
+are considered failures, if the first HTTP check returns <code>200</code> and the second
+<code>202</code>, only one status update after the first success is sent, while a check
+would generate two status updates in this case.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Docker executor currently supports health checks but not checks.</p>
+
+<p><strong>NOTE:</strong> Slight changes in protobuf message naming and structure are due to
+backward compatibility reasons; in the future the <code>HealthCheck</code> message will be
+based on <code>CheckInfo</code>.</p>
+
+<p><a name="anatomy-of-a-check"></a></p>
+
+<h2>Anatomy of a Check</h2>
+
+<p>A <code>CheckStatusInfo</code> message is added to the task status update to convey the
+check status. Currently, check status info is only added for <code>TASK_RUNNING</code>
+status updates.</p>
+
+<p>Built-in executors leverage task status updates to deliver check updates to the
+scheduler. To minimize performance overhead, a check-related task status update
+is triggered if and only if the value or presence of any field in
+<code>CheckStatusInfo</code> changes. As the <code>CheckStatusInfo</code> message matures, in the
+future we might deduplicate only on specific fields in <code>CheckStatusInfo</code> to make
+sure that as few updates as possible are sent. Note that custom executors may
+use a different strategy.</p>
+
+<p>To support third party tooling that might not have access to the original
+<code>TaskInfo</code> specification, <code>TaskStatus.check_status</code> generated by built-in
+executors adheres to the following conventions:</p>
+
+<ul>
+<li>If the original <code>TaskInfo</code> has not specified a check,
+<code>TaskStatus.check_status</code> is not present.</li>
+<li>If the check has been specified, <code>TaskStatus.check_status.type</code> indicates the
+check&rsquo;s type.</li>
+<li>If the check result is not available for some reason (a check has not run yet
+or a check has timed out), the corresponding result is empty, e.g.,
+<code>TaskStatus.check_status.command</code> is present and empty.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Frameworks that use custom executors are highly advised to follow the
+same principles built-in executors use for consistency.</p>
+
+<p><a name="command-checks"></a></p>
+
+<h3>Command Checks</h3>
+
+<p>Command checks are described by the <code>CommandInfo</code> protobuf wrapped in the
+<code>CheckInfo.Command</code> message; some fields are ignored though: <code>CommandInfo.user</code>
+and <code>CommandInfo.uris</code>. A command check specifies an arbitrary command that is
+used to check a particular condition of the task. The result of the check is the
+exit code of the command.</p>
+
+<p><strong>NOTE:</strong> Docker executor does not currently support checks. For all other
+tasks, including Docker containers launched in the
+<a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a>, the command will be executed from
+the task&rsquo;s mount namespace.</p>
+
+<p>To specify a command check, set <code>type</code> to <code>CheckInfo::COMMAND</code> and populate
+<code>CheckInfo.Command.CommandInfo</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::COMMAND);
+check.mutable_command()-&gt;mutable_command()-&gt;set_value(
+    "ls /checkfile &gt; /dev/null");
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<p><a name="http-checks"></a></p>
+
+<h3>HTTP Checks</h3>
+
+<p>HTTP checks are described by the <code>CheckInfo.Http</code> protobuf with <code>port</code> and
+<code>path</code> fields. A <code>GET</code> request is sent to <code>http://&lt;host&gt;:port/path</code> using the
+<code>curl</code> command. Note that <code>&lt;host&gt;</code> is currently not configurable and is set
+automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>), hence
+the checked task must listen on the loopback interface along with any other
+routeable interface it might be listening on. Field <code>port</code> must specify an
+actual port the task is listening on, not a mapped one. The result of the check
+is the HTTP status code of the response.</p>
+
+<p>If necessary, executors enter the task&rsquo;s network namespace prior to launching
+the <code>curl</code> command.</p>
+
+<p><strong>NOTE:</strong> HTTPS checks are currently not supported.</p>
+
+<p>To specify an HTTP check, set <code>type</code> to <code>CheckInfo::HTTP</code> and populate
+<code>CheckInfo.Http</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::HTTP);
+check.mutable_http()-&gt;set_port(8080);
+check.mutable_http()-&gt;set_path("/health");
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<p><a name="tcp-checks"></a></p>
+
+<h3>TCP Checks</h3>
+
+<p>TCP checks are described by the <code>CheckInfo.Tcp</code> protobuf, which has a single
+<code>port</code> field, which must specify an actual port the task is listening on, not a
+mapped one. The task is probed using Mesos' <code>mesos-tcp-connect</code> command, which
+tries to establish a TCP connection to <code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is
+currently not configurable and is set automatically to <code>127.0.0.1</code>
+(see <a href="#current-limitations">limitations</a>), hence the checked task must listen on
+the loopback interface along with any other routeable interface it might be
+listening on. Field <code>port</code> must specify an actual port the task is listening on,
+not a mapped one. The result of the check is the boolean value indicating
+whether a TCP connection succeeded.</p>
+
+<p>If necessary, executors enter the task&rsquo;s network namespace prior to launching
+the <code>mesos-tcp-connect</code> command.</p>
+
+<p>To specify a TCP check, set <code>type</code> to <code>CheckInfo::TCP</code> and populate
+<code>CheckInfo.Tcp</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::TCP);
+check.mutable_tcp()-&gt;set_port(8080);
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<h3>Common options</h3>
+
+<p>The <code>CheckInfo</code> protobuf contains common options which regulate how a check must
+be performed by an executor:</p>
+
+<ul>
+<li><code>delay_seconds</code> is the amount of time to wait until starting checking the
+task.</li>
+<li><code>interval_seconds</code> is the interval between check attempts.</li>
+<li><code>timeout_seconds</code> is the amount of time to wait for the check to complete.
+After this timeout, the check attempt is aborted and empty check update,
+i.e., the absence of the check result, is reported.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Since each time a check is performed a helper command is launched
+(see <a href="#current-limitations">limitations</a>), setting <code>timeout_seconds</code> to a small
+value, e.g., <code>&lt;5s</code>, may lead to intermittent failures.</p>
+
+<p><strong>NOTE:</strong> Launching a check is not a free operation. To avoid unpredictable
+spikes in agent&rsquo;s load, e.g., when most of the tasks run their checks
+simultaneously, avoid setting <code>interval_seconds</code> to zero.</p>
+
+<p>As an example, the code below specifies a task which is a Docker container with
+a simple HTTP server listening on port <code>8080</code> and an HTTP check that should be
+performed every <code>5</code> seconds starting from the task launch and response time
+under <code>1</code> second.</p>
+
+<pre><code class="{.cpp}">TaskInfo task = createTask(...);
+
+// Use Netcat to emulate an HTTP server.
+const string command =
+    "nc -lk -p 8080 -e echo -e \"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\"";
+task.mutable_command()-&gt;set_value(command)
+
+Image image;
+image.set_type(Image::DOCKER);
+image.mutable_docker()-&gt;set_name("alpine");
+
+ContainerInfo* container = task.mutable_container();
+container-&gt;set_type(ContainerInfo::MESOS);
+container-&gt;mutable_mesos()-&gt;mutable_image()-&gt;CopyFrom(image);
+
+// Set `delay_seconds` here because it takes
+// some time to launch Netcat to serve requests.
+CheckInfo check;
+check.set_type(CheckInfo::HTTP);
+check.mutable_http()-&gt;set_port(8080);
+check.set_delay_seconds(15);
+check.set_interval_seconds(5);
+check.set_timeout_seconds(1);
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<h2>Anatomy of a Health Check</h2>
+
+<p>The boolean <code>healthy</code> field is used to convey health status, which
+<a href="#current-limitations">may be insufficient</a> in certain cases. This means a task
+that has failed health checks will be <code>RUNNING</code> with <code>healthy</code> set to <code>false</code>.
+Currently, the <code>healthy</code> field is only set for <code>TASK_RUNNING</code> status updates.</p>
 
 <p>When a task turns unhealthy, a task status update message with the <code>healthy</code>
 field set to <code>false</code> is sent to the Mesos master and then forwarded to a
@@ -185,35 +455,14 @@ consecutive failures defined in the <code>consecutive_failures</code> field of t
 <p><strong>NOTE:</strong> While a scheduler currently cannot cancel a task kill due to failing
 health checks, it may issue a <code>killTask</code> command itself. This may be helpful to
 emulate a &ldquo;global&rdquo; policy for handling tasks with failing health checks (see
-<a href="#current-limitations">limitations</a>).</p>
+<a href="#current-limitations">limitations</a>). Alternatively, the scheduler might use
+<a href="#anatomy-of-a-check">generalized checks</a> instead.</p>
 
 <p>Built-in executors forward all unhealthy status updates, as well as the first
 healthy update when a task turns healthy, i.e., when the task has started or
 after one or more unhealthy updates have occurred. Note that custom executors
 may use a different strategy.</p>
 
-<p>Custom executors can use <a href="#under-the-hood">the health checker library</a>, the
-reference implementation for health checking all built-in executors rely on.</p>
-
-<h2>Anatomy of a Health Check</h2>
-
-<p>Mesos health checks are described in the
-<a href="https://github.com/apache/mesos/blob/1.1.0/include/mesos/mesos.proto#L345"><code>HealthCheck</code></a>
-protobuf. Currently, only tasks can be health checked, not arbitrary processes
-or executors, i.e., only the <code>TaskInfo</code> protobuf has the optional <code>HealthCheck</code>
-field. However, it is worth noting that all built-in executors map a task to a
-process.</p>
-
-<p>It is an executor&rsquo;s responsibility to health check its tasks, because only
-executor knows how to interpret <code>TaskInfo</code>. All built-in executors support
-health checking their tasks (see <a href="#under-the-hood">implementation details</a>
-and <a href="#current-limitations">limitations</a>).</p>
-
-<p><strong>NOTE:</strong> It is up to the executor how&mdash;and whether at all&mdash;to honor
-the <code>HealthCheck</code> field in <code>TaskInfo</code>. Implementations may vary significantly
-depending on what entity <code>TaskInfo</code> represents. In this section only the
-reference implementation for built-in executors is considered.</p>
-
 <p><a name="command-health-checks"></a></p>
 
 <h3>Command Health Checks</h3>
@@ -232,7 +481,9 @@ command will be executed from the task&rsquo;s mount namespace.</p>
 <p>To specify a command health check, set <code>type</code> to <code>HealthCheck::COMMAND</code> and
 populate <code>CommandInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::COMMAND);
 healthCheck.mutable_command()-&gt;set_value("ls /checkfile &gt; /dev/null");
 
@@ -246,13 +497,16 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 <p>HTTP(S) health checks are described by the <code>HealthCheck.HTTPCheckInfo</code> protobuf
 with <code>scheme</code>, <code>port</code>, <code>path</code>, and <code>statuses</code> fields. A <code>GET</code> request is sent to
 <code>scheme://&lt;host&gt;:port/path</code> using the <code>curl</code> command. Note that <code>&lt;host&gt;</code> is
-currently not configurable and is resolved automatically to <code>127.0.0.1</code> (see
-<a href="#current-limitations">limitations</a>). The <code>scheme</code> field supports <code>"http"</code> and
-<code>"https"</code> values only. Field <code>port</code> must specify an actual port the task is
-listening on, not a mapped one.</p>
+currently not configurable and is set automatically to <code>127.0.0.1</code> (see
+<a href="#current-limitations">limitations</a>), hence the health checked task must listen
+on the loopback interface along with any other routeable interface it might be
+listening on. The <code>scheme</code> field supports <code>"http"</code> and <code>"https"</code> values only.
+Field <code>port</code> must specify an actual port the task is listening on, not a mapped
+one.</p>
 
 <p>Built-in executors treat status codes between <code>200</code> and <code>399</code> as success; custom
-executors may employ a different strategy, e.g., leveraging the <code>statuses</code> field.</p>
+executors may employ a different strategy, e.g., leveraging the <code>statuses</code>
+field.</p>
 
 <p><strong>NOTE:</strong> Setting <code>HealthCheck.HTTPCheckInfo.statuses</code> has no effect on the
 built-in executors.</p>
@@ -263,7 +517,9 @@ the <code>curl</code> command.</p>
 <p>To specify an HTTP health check, set <code>type</code> to <code>HealthCheck::HTTP</code> and populate
 <code>HTTPCheckInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::HTTP);
 healthCheck.mutable_http()-&gt;set_port(8080);
 healthCheck.mutable_http()-&gt;set_scheme("http");
@@ -280,8 +536,11 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 which has a single <code>port</code> field, which must specify an actual port the task is
 listening on, not a mapped one. The task is probed using Mesos'
 <code>mesos-tcp-connect</code> command, which tries to establish a TCP connection to
-<code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is currently not configurable and is resolved
-automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>).</p>
+<code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is currently not configurable and is set
+automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>), hence
+the health checked task must listen on the loopback interface along with any
+other routeable interface it might be listening on. Field <code>port</code> must specify an
+actual port the task is listening on, not a mapped one.</p>
 
 <p>The health check is considered successful if the connection can be established.</p>
 
@@ -291,7 +550,9 @@ the <code>mesos-tcp-connect</code> command.</p>
 <p>To specify a TCP health check, set <code>type</code> to <code>HealthCheck::TCP</code> and populate
 <code>TCPCheckInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::TCP);
 healthCheck.mutable_tcp()-&gt;set_port(8080);
 
@@ -301,7 +562,7 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 <h3>Common options</h3>
 
 <p>The <code>HealthCheck</code> protobuf contains common options which regulate how a health
-check must be interpreted by an executor:</p>
+check must be performed and interpreted by an executor:</p>
 
 <ul>
 <li><code>delay_seconds</code> is the amount of time to wait until starting health checking
@@ -326,8 +587,8 @@ to a small value, e.g., <code>&lt;5s</code>, may lead to intermittent failures.<
 
 <p>As an example, the code below specifies a task which is a Docker container with
 a simple HTTP server listening on port <code>8080</code> and an HTTP health check that
-should be performed every second starting from the task launch and allows
-consecutive failures during first <code>15</code> seconds and response time under <code>1</code>
+should be performed every <code>5</code> seconds starting from the task launch and allows
+consecutive failures during the first <code>15</code> seconds and response time under <code>1</code>
 second.</p>
 
 <pre><code class="{.cpp}">TaskInfo task = createTask(...);
@@ -351,7 +612,7 @@ HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::HTTP);
 healthCheck.mutable_http()-&gt;set_port(8080);
 healthCheck.set_delay_seconds(0);
-healthCheck.set_interval_seconds(1);
+healthCheck.set_interval_seconds(5);
 healthCheck.set_timeout_seconds(1);
 healthCheck.set_grace_period_seconds(15);
 
@@ -362,41 +623,53 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 
 <h2>Under the Hood</h2>
 
-<p>All built-in executors rely on the health checker library, which lives in
+<p>All built-in executors rely on the checker library, which lives in
 <a href="https://github.com/apache/mesos/tree/master/src/checks">&ldquo;src/checks&rdquo;</a>.
-An executor creates an instance of the <code>HealthChecker</code> per task and passes the
-health check definition together with extra parameters. In return, the library
-notifies the executor of changes in the task&rsquo;s health status.</p>
+An executor creates an instance of the <code>Checker</code> or <code>HealthChecker</code> class per
+task and passes the check or health check definition together with extra
+parameters. In return, the library notifies the executor of changes in the
+task&rsquo;s check or health status. For health checks, the definition is converted
+to the check definition before performing the check, and the check result is
+interpreted according to the health check definition.</p>
 
 <p>The library depends on <code>curl</code> for HTTP(S) checks and <code>mesos-tcp-connect</code> for
 TCP checks (the latter is a simple command bundled with Mesos).</p>
 
 <p>One of the most non-trivial things the library takes care of is entering the
 appropriate task&rsquo;s namespaces (<code>mnt</code>, <code>net</code>) on Linux agents. To perform a
-command health check, the checker must be in the same mount namespace as the
-checked process; this is achieved by either calling <code>docker run</code> for the health
-check command in case of <a href="/documentation/latest/./docker-containerizer/">docker containerizer</a> or
-by explicitly calling <code>setns()</code> for <code>mnt</code> namespace in case of
-<a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a> (see
-<a href="/documentation/latest/./containerizers/">containerization in Mesos</a>). To perform an HTTP(S) or TCP
-health check, the most reliable solution is to share the same network namespace
+command check, the checker must be in the same mount namespace as the checked
+process; this is achieved by either calling <code>docker run</code> for the check command
+in case of <a href="/documentation/latest/./docker-containerizer/">docker containerizer</a> or by explicitly
+calling <code>setns()</code> for <code>mnt</code> namespace in case of <a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a>
+(see <a href="/documentation/latest/./containerizers/">containerization in Mesos</a>). To perform an HTTP(S) or
+TCP check, the most reliable solution is to share the same network namespace
 with the checked process; in case of docker containerizer <code>setns()</code> for <code>net</code>
 namespace is explicitly called, while mesos containerizer guarantees an executor
 and its tasks are in the same network namespace.</p>
 
-<p><strong>NOTE:</strong> Custom executors may or may not use this library. Please check the
+<p><strong>NOTE:</strong> Custom executors may or may not use this library. Please consult the
 respective framework&rsquo;s documentation.</p>
 
-<p>Regardless of executor, all resources used to health check a task are accounted
-towards task&rsquo;s resource allocation. Hence it is a good idea to add some extra
-resources, e.g., 0.05 cpu and 32MB mem, to the task definition if a Mesos-native
-health check is specified.</p>
+<p>Regardless of executor, all checks and health checks consume resources from the
+task&rsquo;s resource allocation. Hence it is a good idea to add some extra resources,
+e.g., 0.05 cpu and 32MB mem, to the task definition if a Mesos-native check
+and/or health check is specified.</p>
 
 <p><a name="current-limitations"></a></p>
 
-<h2>Current Limitations</h2>
+<h2>Current Limitations and Caveats</h2>
 
 <ul>
+<li>Docker executor does not support generalized checks (see
+<a href="https://issues.apache.org/jira/browse/MESOS-7250">MESOS-7250</a>).</li>
+<li>HTTPS checks are not supported, though HTTPS health checks are (see
+<a href="https://issues.apache.org/jira/browse/MESOS-7356">MESOS-7356</a>).</li>
+<li>Due to the short-polling nature of a check, some task state transitions may be
+missed. For example, if the task transitions are <code>Init [111]</code> &rarr;
+<code>Join [418]</code> &rarr; <code>Ready [200]</code>, the observed HTTP status codes in check
+statuses may be <code>111</code> &rarr; <code>200</code>.</li>
+<li>Due to its short-polling nature, a check whose state oscillates repeatedly
+may lead to scalability issues due to a high volume of task status updates.</li>
 <li>When a task becomes unhealthy, it is deemed to be killed after
 <code>HealthCheck.consecutive_failures</code> failures. This decision is taken locally by
 an executor, there is no way for a scheduler to intervene and react

http://git-wip-us.apache.org/repos/asf/mesos-site/blob/00442383/content/documentation/latest/health-checks/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/latest/health-checks/index.html b/content/documentation/latest/health-checks/index.html
index 46001fc..28f2bcc 100644
--- a/content/documentation/latest/health-checks/index.html
+++ b/content/documentation/latest/health-checks/index.html
@@ -2,7 +2,7 @@
 <html>
   <head>
     <meta charset="utf-8">
-    <title>Apache Mesos - Task Health Checking</title>
+    <title>Apache Mesos - Task Health Checking and Generalized Checks</title>
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
     <meta property="og:locale" content="en_US"/>
@@ -113,7 +113,7 @@
     <p>See our <a href="/community/">community</a> page for more details.</p>
   </div>
   <div class="col-md-8">
-    <h1>Task Health Checking</h1>
+    <h1>Task Health Checking and Generalized Checks</h1>
 
 <p>Sometimes applications crash, misbehave, or become unresponsive. To detect and
 recover from such situations, some frameworks (e.g.,
@@ -126,7 +126,7 @@ executor to respond to the ping. Although this technique is extremely useful,
 there are several disadvantages in the way it is usually implemented:</p>
 
 <ul>
-<li>Each Mesos framework uses its own API and protocol.</li>
+<li>Each Apache Mesos framework uses its own API and protocol.</li>
 <li>Framework developers have to reimplement common functionality.</li>
 <li>Health checks originating from a scheduler generate extra network traffic if
 the task and the scheduler run on different nodes (which is usually the case);
@@ -138,11 +138,16 @@ health checks for every task can cause scheduler performance problems.</li>
 </ul>
 
 
-<p>To address the aforementioned problems, Mesos 1.2.0 introduces
-<a href="#mesos-native-health-checks">the Mesos-native health check design</a>, defines
-common API for <a href="#command-health-checks">command</a>, <a href="#http-health-checks">HTTP(S)</a>,
-and <a href="#tcp-health-checks">TCP</a> health checks, and provides reference
-implementations for all built-in executors.</p>
+<p>To address the aforementioned problems, Mesos 1.2.0 introduced
+<a href="#mesos-native-checking">the Mesos-native health check design</a>, defined
+common API for <a href="#command-health-checks">command</a>,
+<a href="#http-health-checks">HTTP(S)</a>, and <a href="#tcp-health-checks">TCP</a> health checks,
+and provided reference implementations for all built-in executors.</p>
+
+<p>Mesos 1.4.0 introduced <a href="#anatomy-of-a-check">a generalized check</a>, which
+delegates interpretation of a check result to the framework. This might be
+useful, for instance, to track tasks' internal state transitions reliably
+without Mesos taking action on them.</p>
 
 <p><strong>NOTE:</strong> Some functionality related to health checking was available prior to
 1.2.0 release, however it was considered experimental.</p>
@@ -152,29 +157,294 @@ using an equivalent of a <code>waitpid()</code> system call. This technique allo
 detecting and reporting process crashes, but is insufficient for cases when the
 process is still running but is not responsive.</p>
 
-<p>This document describes supported health check types, touches on relevant
-implementation details, and mentions limitations and caveats.</p>
+<p>This document describes supported check and health check types, touches on
+relevant implementation details, and mentions limitations and caveats.</p>
 
-<p><a name="mesos-native-health-checks"></a></p>
+<p><a name="mesos-native-checking"></a></p>
 
-<h2>Mesos-native Health Checks</h2>
+<h2>Mesos-native Task Checking</h2>
 
 <p>In contrast to the state-of-the-art &ldquo;scheduler health check&rdquo; pattern mentioned
-above, Mesos-native health checks run on the agent node: it is the executor
+above, Mesos-native checks run on the agent node: it is the executor
 which performs checks and not the scheduler. This improves scalability but means
 that detecting network faults or task availability from the outside world
 becomes a separate concern. For instance, if the task is running on a
-partitioned agent, it will still be health checked and&mdash;if those health checks
-fail&mdash;will be terminated. Needless to say that due to the network partition,
+partitioned agent, it will still be (health) checked and&mdash;if the health checks
+fail&mdash;might be terminated. Needless to say that due to the network partition,
 all this will happen without the framework scheduler being notified.</p>
 
-<p>Task status updates are leveraged to transfer the health check status to the
-Mesos master and further to the framework&rsquo;s scheduler ensuring the
-&ldquo;at-least-once&rdquo; delivery guarantee. The boolean <code>healthy</code> field is used to
-convey health status, which <a href="#current-limitations">may be insufficient</a> in
-certain cases. This means a task that has failed health checks will be <code>RUNNING</code>
-with <code>healthy</code> set to <code>false</code>. Currently, the <code>healthy</code> field is only set for
-<code>TASK_RUNNING</code> status updates.</p>
+<p>Mesos checks and health checks are described in
+<a href="https://github.com/apache/mesos/blob/cdb90b91ce8ce02d6163e5e2ee5b46fb797b1dee/include/mesos/mesos.proto#L403-L485"><code>CheckInfo</code></a>
+and <a href="https://github.com/apache/mesos/blob/cdb90b91ce8ce02d6163e5e2ee5b46fb797b1dee/include/mesos/mesos.proto#L488-L589"><code>HealthCheck</code></a>
+protobufs respectively. Currently, only tasks can be (health) checked, not
+arbitrary processes or executors, i.e., only the <code>TaskInfo</code> protobuf has the
+optional <code>CheckInfo</code> and <code>HealthCheck</code> fields. However, it is worth noting that
+all built-in executors map a task to a process.</p>
+
+<p>Task status updates are leveraged to transfer the check and health check status
+to the Mesos master and further to the framework&rsquo;s scheduler ensuring the
+&ldquo;at-least-once&rdquo; delivery guarantee. To minimize performance overhead, those task
+status updates are triggered if a certain condition is met, e.g., the value or
+presence of a specific field in the check status changes.</p>
+
+<p>When a built-in executor sends a task status update because the check or health
+check status has changed, it sets <code>TaskStatus.reason</code> to
+<code>REASON_TASK_CHECK_STATUS_UPDATED</code> or <code>REASON_TASK_HEALTH_CHECK_STATUS_UPDATED</code>
+respectively. While sending such an update, the executor avoids shadowing other
+data that might have been injected previously, e.g., a check update includes the
+last known update from a health check.</p>
+
+<p>It is the responsibility of the executor to interpret <code>CheckInfo</code> and
+<code>HealthCheckInfo</code> and perform checks appropriately. All built-in executors
+support health checking their tasks and all except the docker executor support
+generalized checks (see <a href="#under-the-hood">implementation details</a> and
+<a href="#current-limitations">limitations</a>).</p>
+
+<p><strong>NOTE:</strong> It is up to the executor how&mdash;and whether at all&mdash;to honor the
+<code>CheckInfo</code> and <code>HealthCheck</code> fields in <code>TaskInfo</code>. Implementations may vary
+significantly depending on what entity <code>TaskInfo</code> represents. On this page only
+the reference implementation for built-in executors is considered.</p>
+
+<p>Custom executors can use <a href="#under-the-hood">the checker library</a>, the reference
+implementation for health checking that all built-in executors rely on.</p>
+
+<h3>On the Differences Between Checks and Health Checks</h3>
+
+<p>When humans read data from a sensor, they may interpret these data and act on
+them. For example, if they check air temperature, they usually interpret
+temperature readings and say whether it’s cold or warm outside; they may also
+act on the interpretation and decide to apply sunscreen or put on an extra
+jacket.</p>
+
+<p>Similar reasoning can be applied to checking task’s state in Mesos:</p>
+
+<ol>
+<li>Perform a check.</li>
+<li>Optionally interpret the result and, for example, declare the task either
+healthy or unhealthy.</li>
+<li>Optionally act on the interpretation by killing an unhealthy task.</li>
+</ol>
+
+
+<p>Mesos health checks do all of the above, 1+2+3: they run the check, declare the
+task healthy or not, and kill it after <code>consecutive_failures</code> have occurred.
+Though efficient and scalable, this strategy is inflexible for the needs of
+frameworks which may want to run an arbitrary check without Mesos interpreting
+the result in any way, for example, to transmit the task’s internal state
+transitions and make global decisions.</p>
+
+<p>Conceptually, a health check is a check with an interpretation and a kill
+policy. A check and a health check differ in how they are specified and
+implemented:</p>
+
+<ul>
+<li>Built-in executors do not (and custom executors shall not) interpret the
+result of a check. If they do, it should be a health check.</li>
+<li>There is no concept of a check failure, hence grace period and consecutive
+failures options are only available for health checks. Note that a check can
+still time out (a health check interprets timeouts as failures), in this case
+an empty result is sent to the scheduler.</li>
+<li>Health checks do not propagate the result of the underlying check to the
+scheduler, only its interpretation: healthy or unhealthy. Note that this may
+change in the future.</li>
+<li>Health check updates are deduplicated based on the interpretation and not the
+result of the underlying check, i.e., given that only HTTP <code>4**</code> status codes
+are considered failures, if the first HTTP check returns <code>200</code> and the second
+<code>202</code>, only one status update after the first success is sent, while a check
+would generate two status updates in this case.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Docker executor currently supports health checks but not checks.</p>
+
+<p><strong>NOTE:</strong> Slight changes in protobuf message naming and structure are due to
+backward compatibility reasons; in the future the <code>HealthCheck</code> message will be
+based on <code>CheckInfo</code>.</p>
+
+<p><a name="anatomy-of-a-check"></a></p>
+
+<h2>Anatomy of a Check</h2>
+
+<p>A <code>CheckStatusInfo</code> message is added to the task status update to convey the
+check status. Currently, check status info is only added for <code>TASK_RUNNING</code>
+status updates.</p>
+
+<p>Built-in executors leverage task status updates to deliver check updates to the
+scheduler. To minimize performance overhead, a check-related task status update
+is triggered if and only if the value or presence of any field in
+<code>CheckStatusInfo</code> changes. As the <code>CheckStatusInfo</code> message matures, in the
+future we might deduplicate only on specific fields in <code>CheckStatusInfo</code> to make
+sure that as few updates as possible are sent. Note that custom executors may
+use a different strategy.</p>
+
+<p>To support third party tooling that might not have access to the original
+<code>TaskInfo</code> specification, <code>TaskStatus.check_status</code> generated by built-in
+executors adheres to the following conventions:</p>
+
+<ul>
+<li>If the original <code>TaskInfo</code> has not specified a check,
+<code>TaskStatus.check_status</code> is not present.</li>
+<li>If the check has been specified, <code>TaskStatus.check_status.type</code> indicates the
+check&rsquo;s type.</li>
+<li>If the check result is not available for some reason (a check has not run yet
+or a check has timed out), the corresponding result is empty, e.g.,
+<code>TaskStatus.check_status.command</code> is present and empty.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Frameworks that use custom executors are highly advised to follow the
+same principles built-in executors use for consistency.</p>
+
+<p><a name="command-checks"></a></p>
+
+<h3>Command Checks</h3>
+
+<p>Command checks are described by the <code>CommandInfo</code> protobuf wrapped in the
+<code>CheckInfo.Command</code> message; some fields are ignored though: <code>CommandInfo.user</code>
+and <code>CommandInfo.uris</code>. A command check specifies an arbitrary command that is
+used to check a particular condition of the task. The result of the check is the
+exit code of the command.</p>
+
+<p><strong>NOTE:</strong> Docker executor does not currently support checks. For all other
+tasks, including Docker containers launched in the
+<a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a>, the command will be executed from
+the task&rsquo;s mount namespace.</p>
+
+<p>To specify a command check, set <code>type</code> to <code>CheckInfo::COMMAND</code> and populate
+<code>CheckInfo.Command.CommandInfo</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::COMMAND);
+check.mutable_command()-&gt;mutable_command()-&gt;set_value(
+    "ls /checkfile &gt; /dev/null");
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<p><a name="http-checks"></a></p>
+
+<h3>HTTP Checks</h3>
+
+<p>HTTP checks are described by the <code>CheckInfo.Http</code> protobuf with <code>port</code> and
+<code>path</code> fields. A <code>GET</code> request is sent to <code>http://&lt;host&gt;:port/path</code> using the
+<code>curl</code> command. Note that <code>&lt;host&gt;</code> is currently not configurable and is set
+automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>), hence
+the checked task must listen on the loopback interface along with any other
+routeable interface it might be listening on. Field <code>port</code> must specify an
+actual port the task is listening on, not a mapped one. The result of the check
+is the HTTP status code of the response.</p>
+
+<p>If necessary, executors enter the task&rsquo;s network namespace prior to launching
+the <code>curl</code> command.</p>
+
+<p><strong>NOTE:</strong> HTTPS checks are currently not supported.</p>
+
+<p>To specify an HTTP check, set <code>type</code> to <code>CheckInfo::HTTP</code> and populate
+<code>CheckInfo.Http</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::HTTP);
+check.mutable_http()-&gt;set_port(8080);
+check.mutable_http()-&gt;set_path("/health");
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<p><a name="tcp-checks"></a></p>
+
+<h3>TCP Checks</h3>
+
+<p>TCP checks are described by the <code>CheckInfo.Tcp</code> protobuf, which has a single
+<code>port</code> field, which must specify an actual port the task is listening on, not a
+mapped one. The task is probed using Mesos' <code>mesos-tcp-connect</code> command, which
+tries to establish a TCP connection to <code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is
+currently not configurable and is set automatically to <code>127.0.0.1</code>
+(see <a href="#current-limitations">limitations</a>), hence the checked task must listen on
+the loopback interface along with any other routeable interface it might be
+listening on. Field <code>port</code> must specify an actual port the task is listening on,
+not a mapped one. The result of the check is the boolean value indicating
+whether a TCP connection succeeded.</p>
+
+<p>If necessary, executors enter the task&rsquo;s network namespace prior to launching
+the <code>mesos-tcp-connect</code> command.</p>
+
+<p>To specify a TCP check, set <code>type</code> to <code>CheckInfo::TCP</code> and populate
+<code>CheckInfo.Tcp</code>, for example:</p>
+
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+CheckInfo check;
+check.set_type(CheckInfo::TCP);
+check.mutable_tcp()-&gt;set_port(8080);
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<h3>Common options</h3>
+
+<p>The <code>CheckInfo</code> protobuf contains common options which regulate how a check must
+be performed by an executor:</p>
+
+<ul>
+<li><code>delay_seconds</code> is the amount of time to wait until starting checking the
+task.</li>
+<li><code>interval_seconds</code> is the interval between check attempts.</li>
+<li><code>timeout_seconds</code> is the amount of time to wait for the check to complete.
+After this timeout, the check attempt is aborted and empty check update,
+i.e., the absence of the check result, is reported.</li>
+</ul>
+
+
+<p><strong>NOTE:</strong> Since each time a check is performed a helper command is launched
+(see <a href="#current-limitations">limitations</a>), setting <code>timeout_seconds</code> to a small
+value, e.g., <code>&lt;5s</code>, may lead to intermittent failures.</p>
+
+<p><strong>NOTE:</strong> Launching a check is not a free operation. To avoid unpredictable
+spikes in agent&rsquo;s load, e.g., when most of the tasks run their checks
+simultaneously, avoid setting <code>interval_seconds</code> to zero.</p>
+
+<p>As an example, the code below specifies a task which is a Docker container with
+a simple HTTP server listening on port <code>8080</code> and an HTTP check that should be
+performed every <code>5</code> seconds starting from the task launch and response time
+under <code>1</code> second.</p>
+
+<pre><code class="{.cpp}">TaskInfo task = createTask(...);
+
+// Use Netcat to emulate an HTTP server.
+const string command =
+    "nc -lk -p 8080 -e echo -e \"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\"";
+task.mutable_command()-&gt;set_value(command)
+
+Image image;
+image.set_type(Image::DOCKER);
+image.mutable_docker()-&gt;set_name("alpine");
+
+ContainerInfo* container = task.mutable_container();
+container-&gt;set_type(ContainerInfo::MESOS);
+container-&gt;mutable_mesos()-&gt;mutable_image()-&gt;CopyFrom(image);
+
+// Set `delay_seconds` here because it takes
+// some time to launch Netcat to serve requests.
+CheckInfo check;
+check.set_type(CheckInfo::HTTP);
+check.mutable_http()-&gt;set_port(8080);
+check.set_delay_seconds(15);
+check.set_interval_seconds(5);
+check.set_timeout_seconds(1);
+
+task.mutable_check()-&gt;CopyFrom(check);
+</code></pre>
+
+<h2>Anatomy of a Health Check</h2>
+
+<p>The boolean <code>healthy</code> field is used to convey health status, which
+<a href="#current-limitations">may be insufficient</a> in certain cases. This means a task
+that has failed health checks will be <code>RUNNING</code> with <code>healthy</code> set to <code>false</code>.
+Currently, the <code>healthy</code> field is only set for <code>TASK_RUNNING</code> status updates.</p>
 
 <p>When a task turns unhealthy, a task status update message with the <code>healthy</code>
 field set to <code>false</code> is sent to the Mesos master and then forwarded to a
@@ -185,35 +455,14 @@ consecutive failures defined in the <code>consecutive_failures</code> field of t
 <p><strong>NOTE:</strong> While a scheduler currently cannot cancel a task kill due to failing
 health checks, it may issue a <code>killTask</code> command itself. This may be helpful to
 emulate a &ldquo;global&rdquo; policy for handling tasks with failing health checks (see
-<a href="#current-limitations">limitations</a>).</p>
+<a href="#current-limitations">limitations</a>). Alternatively, the scheduler might use
+<a href="#anatomy-of-a-check">generalized checks</a> instead.</p>
 
 <p>Built-in executors forward all unhealthy status updates, as well as the first
 healthy update when a task turns healthy, i.e., when the task has started or
 after one or more unhealthy updates have occurred. Note that custom executors
 may use a different strategy.</p>
 
-<p>Custom executors can use <a href="#under-the-hood">the health checker library</a>, the
-reference implementation for health checking all built-in executors rely on.</p>
-
-<h2>Anatomy of a Health Check</h2>
-
-<p>Mesos health checks are described in the
-<a href="https://github.com/apache/mesos/blob/1.1.0/include/mesos/mesos.proto#L345"><code>HealthCheck</code></a>
-protobuf. Currently, only tasks can be health checked, not arbitrary processes
-or executors, i.e., only the <code>TaskInfo</code> protobuf has the optional <code>HealthCheck</code>
-field. However, it is worth noting that all built-in executors map a task to a
-process.</p>
-
-<p>It is an executor&rsquo;s responsibility to health check its tasks, because only
-executor knows how to interpret <code>TaskInfo</code>. All built-in executors support
-health checking their tasks (see <a href="#under-the-hood">implementation details</a>
-and <a href="#current-limitations">limitations</a>).</p>
-
-<p><strong>NOTE:</strong> It is up to the executor how&mdash;and whether at all&mdash;to honor
-the <code>HealthCheck</code> field in <code>TaskInfo</code>. Implementations may vary significantly
-depending on what entity <code>TaskInfo</code> represents. In this section only the
-reference implementation for built-in executors is considered.</p>
-
 <p><a name="command-health-checks"></a></p>
 
 <h3>Command Health Checks</h3>
@@ -232,7 +481,9 @@ command will be executed from the task&rsquo;s mount namespace.</p>
 <p>To specify a command health check, set <code>type</code> to <code>HealthCheck::COMMAND</code> and
 populate <code>CommandInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::COMMAND);
 healthCheck.mutable_command()-&gt;set_value("ls /checkfile &gt; /dev/null");
 
@@ -246,13 +497,16 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 <p>HTTP(S) health checks are described by the <code>HealthCheck.HTTPCheckInfo</code> protobuf
 with <code>scheme</code>, <code>port</code>, <code>path</code>, and <code>statuses</code> fields. A <code>GET</code> request is sent to
 <code>scheme://&lt;host&gt;:port/path</code> using the <code>curl</code> command. Note that <code>&lt;host&gt;</code> is
-currently not configurable and is resolved automatically to <code>127.0.0.1</code> (see
-<a href="#current-limitations">limitations</a>). The <code>scheme</code> field supports <code>"http"</code> and
-<code>"https"</code> values only. Field <code>port</code> must specify an actual port the task is
-listening on, not a mapped one.</p>
+currently not configurable and is set automatically to <code>127.0.0.1</code> (see
+<a href="#current-limitations">limitations</a>), hence the health checked task must listen
+on the loopback interface along with any other routeable interface it might be
+listening on. The <code>scheme</code> field supports <code>"http"</code> and <code>"https"</code> values only.
+Field <code>port</code> must specify an actual port the task is listening on, not a mapped
+one.</p>
 
 <p>Built-in executors treat status codes between <code>200</code> and <code>399</code> as success; custom
-executors may employ a different strategy, e.g., leveraging the <code>statuses</code> field.</p>
+executors may employ a different strategy, e.g., leveraging the <code>statuses</code>
+field.</p>
 
 <p><strong>NOTE:</strong> Setting <code>HealthCheck.HTTPCheckInfo.statuses</code> has no effect on the
 built-in executors.</p>
@@ -263,7 +517,9 @@ the <code>curl</code> command.</p>
 <p>To specify an HTTP health check, set <code>type</code> to <code>HealthCheck::HTTP</code> and populate
 <code>HTTPCheckInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::HTTP);
 healthCheck.mutable_http()-&gt;set_port(8080);
 healthCheck.mutable_http()-&gt;set_scheme("http");
@@ -280,8 +536,11 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 which has a single <code>port</code> field, which must specify an actual port the task is
 listening on, not a mapped one. The task is probed using Mesos'
 <code>mesos-tcp-connect</code> command, which tries to establish a TCP connection to
-<code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is currently not configurable and is resolved
-automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>).</p>
+<code>&lt;host&gt;:port</code>. Note that <code>&lt;host&gt;</code> is currently not configurable and is set
+automatically to <code>127.0.0.1</code> (see <a href="#current-limitations">limitations</a>), hence
+the health checked task must listen on the loopback interface along with any
+other routeable interface it might be listening on. Field <code>port</code> must specify an
+actual port the task is listening on, not a mapped one.</p>
 
 <p>The health check is considered successful if the connection can be established.</p>
 
@@ -291,7 +550,9 @@ the <code>mesos-tcp-connect</code> command.</p>
 <p>To specify a TCP health check, set <code>type</code> to <code>HealthCheck::TCP</code> and populate
 <code>TCPCheckInfo</code>, for example:</p>
 
-<pre><code class="{.cpp}">HealthCheck healthCheck;
+<pre><code class="{.cpp}">TaskInfo task = [...];
+
+HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::TCP);
 healthCheck.mutable_tcp()-&gt;set_port(8080);
 
@@ -301,7 +562,7 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 <h3>Common options</h3>
 
 <p>The <code>HealthCheck</code> protobuf contains common options which regulate how a health
-check must be interpreted by an executor:</p>
+check must be performed and interpreted by an executor:</p>
 
 <ul>
 <li><code>delay_seconds</code> is the amount of time to wait until starting health checking
@@ -326,8 +587,8 @@ to a small value, e.g., <code>&lt;5s</code>, may lead to intermittent failures.<
 
 <p>As an example, the code below specifies a task which is a Docker container with
 a simple HTTP server listening on port <code>8080</code> and an HTTP health check that
-should be performed every second starting from the task launch and allows
-consecutive failures during first <code>15</code> seconds and response time under <code>1</code>
+should be performed every <code>5</code> seconds starting from the task launch and allows
+consecutive failures during the first <code>15</code> seconds and response time under <code>1</code>
 second.</p>
 
 <pre><code class="{.cpp}">TaskInfo task = createTask(...);
@@ -351,7 +612,7 @@ HealthCheck healthCheck;
 healthCheck.set_type(HealthCheck::HTTP);
 healthCheck.mutable_http()-&gt;set_port(8080);
 healthCheck.set_delay_seconds(0);
-healthCheck.set_interval_seconds(1);
+healthCheck.set_interval_seconds(5);
 healthCheck.set_timeout_seconds(1);
 healthCheck.set_grace_period_seconds(15);
 
@@ -362,41 +623,53 @@ task.mutable_health_check()-&gt;CopyFrom(healthCheck);
 
 <h2>Under the Hood</h2>
 
-<p>All built-in executors rely on the health checker library, which lives in
+<p>All built-in executors rely on the checker library, which lives in
 <a href="https://github.com/apache/mesos/tree/master/src/checks">&ldquo;src/checks&rdquo;</a>.
-An executor creates an instance of the <code>HealthChecker</code> per task and passes the
-health check definition together with extra parameters. In return, the library
-notifies the executor of changes in the task&rsquo;s health status.</p>
+An executor creates an instance of the <code>Checker</code> or <code>HealthChecker</code> class per
+task and passes the check or health check definition together with extra
+parameters. In return, the library notifies the executor of changes in the
+task&rsquo;s check or health status. For health checks, the definition is converted
+to the check definition before performing the check, and the check result is
+interpreted according to the health check definition.</p>
 
 <p>The library depends on <code>curl</code> for HTTP(S) checks and <code>mesos-tcp-connect</code> for
 TCP checks (the latter is a simple command bundled with Mesos).</p>
 
 <p>One of the most non-trivial things the library takes care of is entering the
 appropriate task&rsquo;s namespaces (<code>mnt</code>, <code>net</code>) on Linux agents. To perform a
-command health check, the checker must be in the same mount namespace as the
-checked process; this is achieved by either calling <code>docker run</code> for the health
-check command in case of <a href="/documentation/latest/./docker-containerizer/">docker containerizer</a> or
-by explicitly calling <code>setns()</code> for <code>mnt</code> namespace in case of
-<a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a> (see
-<a href="/documentation/latest/./containerizers/">containerization in Mesos</a>). To perform an HTTP(S) or TCP
-health check, the most reliable solution is to share the same network namespace
+command check, the checker must be in the same mount namespace as the checked
+process; this is achieved by either calling <code>docker run</code> for the check command
+in case of <a href="/documentation/latest/./docker-containerizer/">docker containerizer</a> or by explicitly
+calling <code>setns()</code> for <code>mnt</code> namespace in case of <a href="/documentation/latest/./mesos-containerizer/">mesos containerizer</a>
+(see <a href="/documentation/latest/./containerizers/">containerization in Mesos</a>). To perform an HTTP(S) or
+TCP check, the most reliable solution is to share the same network namespace
 with the checked process; in case of docker containerizer <code>setns()</code> for <code>net</code>
 namespace is explicitly called, while mesos containerizer guarantees an executor
 and its tasks are in the same network namespace.</p>
 
-<p><strong>NOTE:</strong> Custom executors may or may not use this library. Please check the
+<p><strong>NOTE:</strong> Custom executors may or may not use this library. Please consult the
 respective framework&rsquo;s documentation.</p>
 
-<p>Regardless of executor, all resources used to health check a task are accounted
-towards task&rsquo;s resource allocation. Hence it is a good idea to add some extra
-resources, e.g., 0.05 cpu and 32MB mem, to the task definition if a Mesos-native
-health check is specified.</p>
+<p>Regardless of executor, all checks and health checks consume resources from the
+task&rsquo;s resource allocation. Hence it is a good idea to add some extra resources,
+e.g., 0.05 cpu and 32MB mem, to the task definition if a Mesos-native check
+and/or health check is specified.</p>
 
 <p><a name="current-limitations"></a></p>
 
-<h2>Current Limitations</h2>
+<h2>Current Limitations and Caveats</h2>
 
 <ul>
+<li>Docker executor does not support generalized checks (see
+<a href="https://issues.apache.org/jira/browse/MESOS-7250">MESOS-7250</a>).</li>
+<li>HTTPS checks are not supported, though HTTPS health checks are (see
+<a href="https://issues.apache.org/jira/browse/MESOS-7356">MESOS-7356</a>).</li>
+<li>Due to the short-polling nature of a check, some task state transitions may be
+missed. For example, if the task transitions are <code>Init [111]</code> &rarr;
+<code>Join [418]</code> &rarr; <code>Ready [200]</code>, the observed HTTP status codes in check
+statuses may be <code>111</code> &rarr; <code>200</code>.</li>
+<li>Due to its short-polling nature, a check whose state oscillates repeatedly
+may lead to scalability issues due to a high volume of task status updates.</li>
 <li>When a task becomes unhealthy, it is deemed to be killed after
 <code>HealthCheck.consecutive_failures</code> failures. This decision is taken locally by
 an executor, there is no way for a scheduler to intervene and react


Mime
View raw message