Return-Path: X-Original-To: apmail-mesos-commits-archive@www.apache.org Delivered-To: apmail-mesos-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5F1D618ECF for ; Mon, 13 Jul 2015 22:09:28 +0000 (UTC) Received: (qmail 38255 invoked by uid 500); 13 Jul 2015 22:09:28 -0000 Delivered-To: apmail-mesos-commits-archive@mesos.apache.org Received: (qmail 38226 invoked by uid 500); 13 Jul 2015 22:09:28 -0000 Mailing-List: contact commits-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list commits@mesos.apache.org Received: (qmail 38213 invoked by uid 99); 13 Jul 2015 22:09:28 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Jul 2015 22:09:28 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id CB817AC0E7E for ; Mon, 13 Jul 2015 22:09:27 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1690830 [4/4] - in /mesos/site: publish/ publish/documentation/ publish/documentation/app-framework-development-guide/ publish/documentation/attributes-resources/ publish/documentation/configuration/ publish/documentation/latest/ publish/d... Date: Mon, 13 Jul 2015 22:09:26 -0000 To: commits@mesos.apache.org From: dlester@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150713220927.CB817AC0E7E@hades.apache.org> Modified: mesos/site/source/documentation/latest.html.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest.html.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest.html.md (original) +++ mesos/site/source/documentation/latest.html.md Mon Jul 13 22:09:25 2015 @@ -21,7 +21,7 @@ layout: documentation * [High Availability](/documentation/latest/high-availability/) for running multiple masters simultaneously. * [Operational Guide](/documentation/latest/operational-guide/) * [Monitoring](/documentation/latest/monitoring/) -* [Network Monitoring](/documentation/latest/network-monitoring/) +* [Network Monitoring and Isolation](/documentation/latest/network-monitoring/) * [Slave Recovery](/documentation/latest/slave-recovery/) for doing seamless upgrades. * [Tools](/documentation/latest/tools/) for setting up and running a Mesos cluster. Modified: mesos/site/source/documentation/latest/app-framework-development-guide.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/app-framework-development-guide.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/app-framework-development-guide.md (original) +++ mesos/site/source/documentation/latest/app-framework-development-guide.md Mon Jul 13 22:09:25 2015 @@ -15,6 +15,7 @@ You can write a framework scheduler in C ### Scheduler API Declared in `MESOS_HOME/include/mesos/scheduler.hpp` + ~~~{.cpp} /* * Empty virtual destructor (necessary to instantiate subclasses). @@ -68,7 +69,7 @@ virtual void resourceOffers(SchedulerDri * Invoked when an offer is no longer valid (e.g., the slave was * lost or another framework used resources in the offer). If for * whatever reason an offer is never rescinded (e.g., dropped - * message, failing over framework, etc.), a framwork that attempts + * message, failing over framework, etc.), a framework that attempts * to launch tasks using an invalid offer will receive TASK_LOST * status updats for those tasks (see Scheduler::resourceOffers). */ Modified: mesos/site/source/documentation/latest/attributes-resources.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/attributes-resources.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/attributes-resources.md (original) +++ mesos/site/source/documentation/latest/attributes-resources.md Mon Jul 13 22:09:25 2015 @@ -6,39 +6,45 @@ layout: documentation The Mesos system has two basic methods to describe the slaves that comprise a cluster. One of these is managed by the Mesos master, the other is simply passed onwards to the frameworks using the cluster. -## Attributes +## Types -The attributes are simply key value string pairs that Mesos passes along when it sends offers to frameworks. +The types of values that are supported by Attributes and Resources in Mesos are scalar, ranges, sets and text. - attributes : attribute ( ";" attribute )* +The following are the definitions of these types: - attribute : labelString ":" ( labelString | "," )+ + scalar : floatValue -## Resources + floatValue : ( intValue ( "." intValue )? ) | ... -The Mesos system can manage 3 different *types* of resources: scalars, ranges, and sets. These are used to represent the different resources that a Mesos slave has to offer. For example, a scalar resource type could be used to represent the amount of memory on a slave. Each resource is identified by a key string. + intValue : [0-9]+ - resources : resource ( ";" resource )* + range : "[" rangeValue ( "," rangeValue )* "]" - resource : key ":" ( scalar | range | set ) + rangeValue : scalar "-" scalar - key : labelString ( "(" resourceRole ")" )? + set : "{" text ( "," text )* "}" - scalar : floatValue + text : [a-zA-Z0-9_/.-] - range : "[" rangeValue ( "," rangeValue )* "]" +## Attributes - rangeValue : scalar "-" scalar +Attributes are key value pairs (where value is optional) that Mesos passes along when it sends offers to frameworks. An attribute value supports 3 different *types*: scalar, range or text. - set : "{" labelString ( "," labelString )* "}" + attributes : attribute ( ";" attribute )* - resourceRole : labelString | "*" + attribute : text ":" ( scalar | range | text ) - labelString : [a-zA-Z0-9_/.-] +## Resources - floatValue : ( intValue ( "." intValue )? ) | ... +The Mesos system can manage 3 different *types* of resources: scalars, ranges, and sets. These are used to represent the different resources that a Mesos slave has to offer. For example, a scalar resource type could be used to represent the amount of memory on a slave. Each resource is identified by a key string. - intValue : [0-9]+ + resources : resource ( ";" resource )* + + resource : key ":" ( scalar | range | set ) + + key : text ( "(" resourceRole ")" )? + + resourceRole : text | "*" ## Predefined Uses & Conventions @@ -56,7 +62,7 @@ In particular, a slave without `cpus` an Here are some examples for configuring the Mesos slaves. --resources='cpus:24;mem:24576;disk:409600;ports:[21000-24000];bugs:{a,b,c}' - --attributes='rack:abc;zone:west;os:centos5,full' + --attributes='rack:abc;zone:west;os:centos5;level:10;keys:[1000-1500]' In this case, we have three different types of resources, scalars, a range, and a set. They are called `cpus`, `mem`, `disk`, and the range type is `ports`. @@ -68,6 +74,8 @@ In this case, we have three different ty In the case of attributes, we end up with three attributes: - - `rack` with value `abc` - - `zone` with value `west` - - `os` with value `centos5,full` + - `rack` with text value `abc` + - `zone` with text value `west` + - `os` with text value `centos5` + - `level` with scalar value 10 + - `keys` with range value `1000` through `1500` (inclusive) \ No newline at end of file Modified: mesos/site/source/documentation/latest/configuration.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/configuration.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/configuration.md (original) +++ mesos/site/source/documentation/latest/configuration.md Mon Jul 13 22:09:25 2015 @@ -14,7 +14,7 @@ Configuration values are searched for fi **Important Options** -If you have special compilation requirements, please refer to `./configure --help` when configuring Mesos. Additionally, the documentation lists only a subset of the options. A definitive source for which flags your version of Mesos supports can be found by running the binary with the flag `--help`, for example `mesos-master --help`. +If you have special compilation requirements, please refer to `./configure --help` when configuring Mesos. Additionally, this documentation lists only a recent snapshot of the options in Mesos. A definitive source for which flags your version of Mesos supports can be found by running the binary with the flag `--help`, for example `mesos-master --help`. ## Master and Slave Options @@ -33,11 +33,12 @@ If you have special compilation requirem - --ip=VALUE + --external_log_file=VALUE - IP address to listen on - + Specified the externally managed log file. This file will be + exposed in the webui and HTTP api. This is useful when using + stderr logging as the log file is otherwise unknown to Mesos. @@ -74,6 +75,24 @@ If you have special compilation requirem + --[no-]initialize_driver_logging + + + Whether to automatically initialize Google logging of scheduler + and/or executor drivers. (default: true) + + + + + --ip=VALUE + + + IP address to listen on + + + + + --log_dir=VALUE @@ -316,16 +335,6 @@ file:///path/to/file (where file contain - --external_log_file=VALUE - - - Specified the externally managed log file. This file will be - exposed in the webui and HTTP api. This is useful when using - stderr logging as the log file is otherwise unknown to Mesos. - - - - --framework_sorter=VALUE @@ -744,6 +753,15 @@ file:///path/to/file (where file contain + --[no-]cgroups_cpu_enable_pids_and_tids_count + + + Cgroups feature flag to enable counting of processes and threads + inside a container. (default: false) + + + + --[no-]cgroups_enable_cfs @@ -980,31 +998,36 @@ file:///path/to/file (where file contain - --executor_registration_timeout=VALUE + --executor_environment_variables - Amount of time to wait for an executor - to register with the slave before considering it hung and - shutting it down (e.g., 60secs, 3mins, etc) (default: 1mins) + JSON object representing the environment variables that should be + passed to the executor, and thus subsequently task(s). + By default the executor will inherit the slave's environment variables. + Example: +
{
+  "PATH": "/bin:/usr/bin",
+  "LD_LIBRARY_PATH": "/usr/local/lib"
+}
- --executor_shutdown_grace_period=VALUE + --executor_registration_timeout=VALUE Amount of time to wait for an executor - to shut down (e.g., 60secs, 3mins, etc) (default: 5secs) + to register with the slave before considering it hung and + shutting it down (e.g., 60secs, 3mins, etc) (default: 1mins) - --external_log_file=VALUE + --executor_shutdown_grace_period=VALUE - Specified the externally managed log file. This file will be - exposed in the webui and HTTP api. This is useful when using - stderr logging as the log file is otherwise unknown to Mesos. + Amount of time to wait for an executor + to shut down (e.g., 60secs, 3mins, etc) (default: 5secs) @@ -1139,6 +1162,17 @@ file:///path/to/file (where file contain + --oversubscribed_resources_interval=VALUE + + + The slave periodically updates the master with the current estimation + about the total amount of oversubscribed resources that are allocated + and available. The interval between updates is controlled by this flag. + (default: 15secs) + + + + --perf_duration=VALUE @@ -1174,6 +1208,25 @@ file:///path/to/file (where file contain + --qos_controller=VALUE + + + The name of the QoS Controller to use for oversubscription. + + + + + --qos_correction_interval_min=VALUE + + + The slave polls and carries out QoS corrections from the QoS + Controller based on its observed performance of running tasks. + The smallest interval between these corrections is controlled by + this flag. (default: 0secs) + + + + --recover=VALUE @@ -1221,6 +1274,14 @@ file:///path/to/file (where file contain + --resource_estimator=VALUE + + + The name of the resource estimator to use for oversubscription. + + + + --resource_monitoring_interval=VALUE @@ -1240,6 +1301,16 @@ file:///path/to/file (where file contain + --[no-]revocable_cpu_low_priority + + + Run containers with revocable CPU at a lower priority than + normal containers (non-revocable cpu). Currently only + supported by the cgroups/cpu isolator. (default: true) + + + + --slave_subsystems=VALUE @@ -1357,11 +1428,22 @@ file:///path/to/file (where file contain - --[no-]network_enable_socket_statistics + --[no-]network_enable_socket_statistics_summary + + + Whether to collect socket statistics summary for each container. + This flag is used for the 'network/port_mapping' isolator. + (default: false) + + + + + --[no-]network_enable_socket_statistics_details - Whether to collect socket statistics (e.g., TCP RTT) for - each container. (default: false) + Whether to collect socket statistics details (e.g., TCP RTT) for + each container. This flag is used for the 'network/port_mapping' + isolator. (default: false) Modified: mesos/site/source/documentation/latest/logging-and-debugging.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/logging-and-debugging.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/logging-and-debugging.md (original) +++ mesos/site/source/documentation/latest/logging-and-debugging.md Mon Jul 13 22:09:25 2015 @@ -4,6 +4,6 @@ layout: documentation # Logging and Debugging -Mesos uses the [Google Logging library](/documentation/latest/http://code.google.com/p/google-glog) and writes logs to `MESOS_HOME/logs` by default, where `MESOS_HOME` is the location where Mesos is installed. The log directory can be [configured](configuration/) using the `log_dir` parameter. +Mesos uses the [Google Logging library](/documentation/latest/https://github.com/google/glog) and writes logs to `MESOS_HOME/logs` by default, where `MESOS_HOME` is the location where Mesos is installed. The log directory can be [configured](configuration/) using the `log_dir` parameter. Frameworks that run on Mesos have their output stored to a "work" directory on each machine. By default, this is `MESOS_HOME/work`. Within this directory, a framework's output is placed in files called `stdout` and `stderr` in a directory of the form `slave-X/fw-Y/Z`, where X is the slave ID, Y is the framework ID, and multiple subdirectories Z are created for each attempt to run an executor for the framework. These files can also be accessed via the web UI of the slave daemon. \ No newline at end of file Modified: mesos/site/source/documentation/latest/mesos-c++-style-guide.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/mesos-c%2B%2B-style-guide.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/mesos-c++-style-guide.md (original) +++ mesos/site/source/documentation/latest/mesos-c++-style-guide.md Mon Jul 13 22:09:25 2015 @@ -59,6 +59,9 @@ void Slave::statusUpdate(StatusUpdate up * Access modifiers are not indented (Google uses one space indentation). * Constructor initializers are indented by 2 spaces (Google indents by 4). +### Templates +* Leave one space after the `template` keyword, e.g. `template ` rather than `template`. + ### Function Definition/Invocation * Newline when calling or defining a function: indent with 4 spaces. * We do not follow Google's style of wrapping on the open parenthesis, the general goal is to reduce visual "jaggedness" in the code. Prefer (1), (4), (5), sometimes (3), never (2): Modified: mesos/site/source/documentation/latest/mesos-doxygen-style-guide.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/mesos-doxygen-style-guide.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/mesos-doxygen-style-guide.md (original) +++ mesos/site/source/documentation/latest/mesos-doxygen-style-guide.md Mon Jul 13 22:09:25 2015 @@ -54,10 +54,18 @@ This is the allowed set of doxygen tags * [\@param](http://doxygen.org/manual/commands.html#cmdparam) Describes function parameters. * [\@return](http://doxygen.org/manual/commands.html#cmdreturn) Describes return values. * [\@see](http://doxygen.org/manual/commands.html#cmdsa) Describes a cross-reference to classes, functions, methods, variables, files or URL. + +Example: + + /** + * Available kinds of implementations. + * + * @see process::network::PollSocketImpl + */ + * [\@file](http://doxygen.org/manual/commands.html#cmdfile) Describes a refence to a file. It is required when documenting global functions, variables, typedefs, or enums in separate files. * [\@link](http://doxygen.org/manual/commands.html#cmdlink) and [\@endlink](http://doxygen.org/manual/commands.html#cmdendlink) Describes a link to a file, class, or member. * [\@example](http://doxygen.org/manual/commands.html#cmdexample) Describes source code examples. - * [\@todo](http://doxygen.org/manual/commands.html#cmdtodo) Describes a TODO item. * [\@image](http://doxygen.org/manual/commands.html#cmdimage) Describes an image. * When following these links be aware that the doxygen documentation is using another syntax in that \@param is explained as \\param. @@ -90,7 +98,7 @@ Example: /** * The parent side of the pipe for stdin. * If the mode is not PIPE, None will be stored. - * @note: stdin is a macro on some systems, hence this name instead. + * **NOTE**: stdin is a macro on some systems, hence this name instead. */ Option in; @@ -162,7 +170,7 @@ stout, libprocess, master, slave, contai should have an overview page in markdown format that explains their purpose, overall structure, and general use. This can even be a complete developer guide. -This page must be located in the top directory of the library/component and named "REAMDE.md". +This page must be located in the top directory of the library/component and named "README.md". The first line in such a document must be a section heading bearing the title which will appear in the generated Doxygen index. Example: "# Libprocess Developer Guide" Modified: mesos/site/source/documentation/latest/network-monitoring.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/network-monitoring.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/network-monitoring.md (original) +++ mesos/site/source/documentation/latest/network-monitoring.md Mon Jul 13 22:09:25 2015 @@ -2,129 +2,359 @@ layout: documentation --- -# Network Monitoring +# Per-container Network Monitoring and Isolation -Mesos 0.20.0 adds the support for per container network monitoring. Network statistics for each active container can be retrieved through the `/monitor/statistics.json` endpoint on the slave. - -The current solution is completely transparent to the tasks running on the slave. In other words, tasks will not notice any difference as if they were running on a slave without network monitoring turned on and were sharing the network of the slave. - -## How to setup? - -To turn on network monitoring on your mesos cluster, you need to follow the following procedures. +Mesos on Linux provides support for per-container network monitoring and +isolation. The network isolation prevents a single container from exhausting the +available network ports, consuming an unfair share of the network bandwidth or +significantly delaying packet transmission for others. Network statistics for +each active container are published through the `/monitor/statistics.json` +endpoint on the slave. The network isolation is transparent for the majority of +tasks running on a slave (those that bind to port 0 and let the kernel allocate +their port). + +## Installation + +Per-container network monitoring and isolation is __not__ supported by default. +To enable it you need to install additional dependencies and configure it during +the build process. ### Prerequisites -Currently, network monitoring is only supported on Linux. Make sure your kernel is at least 3.6. Also, check your kernel to make sure that the following upstream patches are merged in (Mesos will automatically check for those kernel functionalities and will abort if they are not supported): +Per-container network monitoring and isolation is only supported on Linux kernel +versions 3.6 and above. Additionally, the kernel must include these patches +(merged in kernel version 3.15). * [6a662719c9868b3d6c7d26b3a085f0cd3cc15e64](https://github.com/torvalds/linux/commit/6a662719c9868b3d6c7d26b3a085f0cd3cc15e64) * [0d5edc68739f1c1e0519acbea1d3f0c1882a15d7](https://github.com/torvalds/linux/commit/0d5edc68739f1c1e0519acbea1d3f0c1882a15d7) * [e374c618b1465f0292047a9f4c244bd71ab5f1f0](https://github.com/torvalds/linux/commit/e374c618b1465f0292047a9f4c244bd71ab5f1f0) * [25f929fbff0d1bcebf2e92656d33025cd330cbf8](https://github.com/torvalds/linux/commit/25f929fbff0d1bcebf2e92656d33025cd330cbf8) -Make sure the following packages are installed on the slave: +The following packages are required on the slave: * [libnl3](http://www.infradead.org/~tgr/libnl/) >= 3.2.26 -* [iproute](http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2) (>= 2.6.39 is advised but not required for debugging purpose) +* [iproute](http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2) >= 2.6.39 is advised for debugging purpose but not required. -On the build machine, you need to install the following packages: +Additionally, if you are building from source, you need will also need the +libnl3 development package to compile Mesos: -* [libnl3-devel](http://www.infradead.org/~tgr/libnl/) >= 3.2.26 +* [libnl3-devel / libnl3-dev](http://www.infradead.org/~tgr/libnl/) >= 3.2.26 -### Configure and build +### Build -Network monitoring will NOT be built in by default. To build Mesos with network monitoring support, you need to add a configure option: +To build Mesos with per-container network monitoring and isolation support, you +need to add a configure option: $ ./configure --with-network-isolator $ make +## Configuration + +Per-container network monitoring and isolation is enabled on the slave by adding +`network/port_mapping` to the slave command line `--isolation` flag. + + --isolation="network/port_mapping" + +If the slave has not been compiled with per-container network monitoring and +isolation support, it will refuse to start and print an error: + + I0708 00:17:08.080271 44267 containerizer.cpp:111] Using isolation: network/port_mapping + Failed to create a containerizer: Could not create MesosContainerizer: Unknown or unsupported + isolator: network/port_mapping + +## Configuring network ports + +Without network isolation, all the containers on a host share the public IP +address of the slave and can bind to any port allowed by the OS. + +When network isolation is enabled, each container on the slave has a separate +network stack (via Linux [network namespaces](http://lwn.net/Articles/580893/)). +All containers still share the same public IP of the slave (so that the service +discovery mechanism does not need to be changed). The slave assigns each +container a non-overlapping range of the ports and only packets to/from these +assigned port ranges will be delivered. Applications requesting the kernel +assign a port (by binding to port 0) will be given ports from the container +assigned range. Applications can bind to ports outside the container assigned +ranges but packets from to/from these ports will be silently dropped by the +host. + +Mesos provides two ranges of ports to containers: + ++ OS allocated "[ephemeral](https://en.wikipedia.org/wiki/Ephemeral_port)" ports +are assigned by the OS in a range specified for each container by Mesos. + ++ Mesos allocated "non-ephemeral" ports are acquired by a framework using the +same Mesos resource offer mechanism used for cpu, memory etc. for allocation to +executors/tasks as required. + +Additionally, the host itself will require ephemeral ports for network +communication. You need to configure these three __non-overlapping__ port ranges +on the host. + +### Host ephemeral port range + +The currently configured host ephemeral port range can be discovered at any time +using the command `sysctl net.ipv4.ip_local_port_range`. If ports need to be set +aside for slave containers, the ephemeral port range can be updated in +`/etc/sysctl.conf`. Rebooting after the update will apply the change and +eliminate the possibility that ports are already in use by other processes. For +example, by adding the following: + + # net.ipv4.ip_local_port_range defines the host ephemeral port range, by + # default 32768-61000. We reduce this range to allow the Mesos slave to + # allocate ports 32768-57344 + # net.ipv4.ip_local_port_range = 32768 61000 + net.ipv4.ip_local_port_range = 57345 61000 + +### Container port ranges -### Host ephemeral ports squeeze +The container ephemeral and non-ephemeral port ranges are configured using the +slave `--resources` flag. The non-ephemeral port range is provided to the +master, which will then offer it to frameworks for allocation. -With network monitoring being turned on, each container on the slave will have a separate network stack (via Linux [network namespaces](http://lwn.net/Articles/580893/)). All containers share the same public IP of the slave (so that service discovery mechanism does not need to be changed). Each container will be assigned a subset of the ports from the host, and is only allowed to use those ports to make connections with other hosts. +The ephemeral port range is sub-divided by the slave, giving +`ephemeral_ports_per_container` (default 1024) to each container. The maximum +number of containers on the slave will therefore be limited to approximately: -For non-ephemeral ports (e.g, listening ports), Mesos already exposes that to the scheduler (resource: 'ports'). The scheduler is responsible for allocating those ports to executors/tasks. + number of ephemeral_ports / ephemeral_ports_per_container -For ephemeral ports, without network monitoring, all executors/tasks running on the slave share the same ephemeral port range of the host. The default ephemeral port range on most Linux distributions is [32768, 61000]. With network monitoring, for each container, we need to reserve a range for ports on the host which will be used as the ephemeral port range for the container network stack (these ports are directly mapped into the container). We need to ensure none of the host processes are using those ports. Because of that, you may want to squeeze the host ephemeral port range in order to support more containers on each slave. To do that, you can use the following command (need root permission). A host reboot is required to ensure there are no connections using ports outside the new ephemeral range. +The master `--max_executors_per_slave` flag is be used to prevent allocation of +more executors on a slave when the ephemeral port range has been exhausted. - # This sets the host ephemeral port range to [57345, 61000]. - $ echo "57345 61000" > /proc/sys/net/ipv4/ip_local_port_range +It is recommended (but not required) that `ephemeral_ports_per_container` be set +to a power of 2 (e.g., 512, 1024) and the lower bound of the ephemeral port +range be a multiple of `ephemeral_ports_per_container` to minimize CPU overhead +in packet processing. For example: + --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] \ + --ephemeral_ports_per_container=512 -### Turn on network monitoring +### Rate limiting container traffic -After the host ephemeral ports squeeze and reboot, you can turn on network monitoring by appending `network/port_mapping` to the isolation flag. Notice that you need specify the `ephemeral_ports` resource (via --resources flag). It tells the slave which ports on the host are reserved for containers. It must NOT overlap with the host ephemeral port range. You can also specify how many ephemeral ports you want to allocate to each container. It is recommended but not required that this number is power of 2 aligned (e.g., 512, 1024). If not, there will be some performance impact for classifying packets. The maximum number of containers on the slave will be limited by approximately |ephemeral_ports|/ephemeral_ports_per_container, subject to alignment etc. +Outbound traffic from a container to the network can be rate limited to prevent +a single container from consuming all available network resources with +detrimental effects to the other containers on the host. The +`--egress_rate_limit_per_container` flag specifies that each container launched +on the host be limited to the specified bandwidth (in bytes per second). +Network traffic which would cause this limit to be exceeded is delayed for later +transmission. The TCP protocol will adjust to the increased latency and reduce +the transmission rate ensuring no packets need be dropped. + + --egress_rate_limit_per_container=100MB + +We do not rate limit inbound traffic since we can only modify the network flows +after they have been received by the host and any congestion has already +occurred. + +### Egress traffic isolation + +Delaying network data for later transmission can increase latency and jitter +(variability) for all traffic on the interface. Mesos can reduce the impact on +other containers on the same host by using flow classification and isolation +using the containers port ranges to maintain unique flows for each container and +sending traffic from these flows fairly (using the +[FQ_Codel](https://tools.ietf.org/html/draft-hoeiland-joergensen-aqm-fq-codel-00) +algorithm). Use the `--egress_unique_flow_per_container` flag to enable. + + --egress_unique_flow_per_container + +### Putting it all together + +A complete slave command line enabling network isolation, reserving ports +57345-61000 for host ephemeral ports, 32768-57344 for container ephemeral ports, +31000-32000 for non-ephemeral ports allocated by the framework, limiting +container transmit bandwidth to 300 Mbits/second (37.5MBytes) with unique flows +enabled would thus be: mesos-slave \ - --checkpoint \ - --log_dir=/var/log/mesos \ - --work_dir=/var/lib/mesos \ - --isolation=cgroups/cpu,cgroups/mem,network/port_mapping \ - --resources=cpus:22;mem:62189;ports:[31000-32000];disk:400000;ephemeral_ports:[32768-57344] \ - --ephemeral_ports_per_container=1024 - - -## How to get statistics? - -Currently, we report the following network statistics: - -* _net_rx_bytes_ -* _net_rx_dropped_ -* _net_rx_errors_ -* _net_rx_packets_ -* _net_tx_bytes_ -* _net_tx_dropped_ -* _net_tx_errors_ -* _net_tx_packets_ + --isolation=network/port_mapping \ + --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] \ + --ephemeral_ports_per_container=1024 \ + --egress_rate_limit_per_container=37500KB \ + --egress_unique_flow_per_container + +## Monitoring container network statistics + +Mesos exposes statistics from the Linux network stack for each container network +on the `/monitor/statistics.json` slave endpoint. + +From the network interface inside the container, we report the following +counters (since container creation) under the `statistics` key: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionType
net_rx_bytesReceived bytesCounter
net_rx_droppedPackets dropped on receiveCounter
net_rx_errorsErrors reported on receiveCounter
net_rx_packetsPackets receivedCounter
net_tx_bytesSent bytesCounter
net_tx_droppedPackets dropped on sendCounter
net_tx_errorsErrors reported on sendCounter
net_tx_packetsPackets sentCounter
+ +Additionally, [Linux Traffic Control]( +http://tldp.org/HOWTO/Traffic-Control-HOWTO/intro.html) can report the following +statistics for the elements which implement bandwidth limiting and bloat +reduction under the `statistics/net_traffic_control_statistics` key. The entry +for each of these elements includes: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionType
backlogBytes queued for transmission [1]Gauge
bytesSent bytesCounter
dropsPackets dropped on sendCounter
overlimitsCount of times the interface was over its transmit limit when it attempted to send a packet. Since the normal action when the network is overlimit is to delay the packet, the overlimit counter can be incremented many times for each packet sent on a heavily congested interface. [2]Counter
packetsPackets sentCounter
qlenPackets queued for transmissionGauge
ratebpsTransmit rate in bytes/second [3]Gauge
rateppsTransmit rate in packets/second [3]Gauge
requeuesPackets failed to send due to resource contention (such as kernel locking) [3]Counter
+ +[1] Backlog is only reported on the bloat_reduction interface + +[2] Overlimits are only reported on the bw_limit interface + +[3] Currently always reported as 0 by the underlying Traffic Control element. For example, these are the statistics you will get by hitting the `/monitor/statistics.json` endpoint on a slave with network monitoring turned on: - $ curl -s http://localhost:5051/monitor/statistics.json | python2.6 - -mjson.tool + $ curl -s http://localhost:5051/monitor/statistics.json | python2.6 -mjson.tool [ { - "executor_id": "sample_executor_id-ebd8fa62-757d-489e-9e23-678a21d078d6", - "executor_name": "sample_executor", - "framework_id": "201103282247-0000000019-0000", - "source": "sample_executor", + "executor_id": "job.1436298853", + "executor_name": "Command Executor (Task: job.1436298853) (Command: sh -c 'iperf ....')", + "framework_id": "20150707-195256-1740121354-5150-29801-0000", + "source": "job.1436298853", "statistics": { - "cpus_limit": 0.35, - "cpus_nr_periods": 520883, - "cpus_nr_throttled": 2163, - "cpus_system_time_secs": 154.42, - "cpus_throttled_time_secs": 145.96, - "cpus_user_time_secs": 258.74, - "mem_anon_bytes": 109137920, - "mem_file_bytes": 30613504, + "cpus_limit": 1.1, + "cpus_nr_periods": 16314, + "cpus_nr_throttled": 16313, + "cpus_system_time_secs": 2667.06, + "cpus_throttled_time_secs": 8036.840845388, + "cpus_user_time_secs": 123.49, + "mem_anon_bytes": 8388608, + "mem_cache_bytes": 16384, + "mem_critical_pressure_counter": 0, + "mem_file_bytes": 16384, "mem_limit_bytes": 167772160, - "mem_mapped_file_bytes": 8192, - "mem_rss_bytes": 140341248, - "net_rx_bytes": 2402099, + "mem_low_pressure_counter": 0, + "mem_mapped_file_bytes": 0, + "mem_medium_pressure_counter": 0, + "mem_rss_bytes": 8388608, + "mem_total_bytes": 9945088, + "net_rx_bytes": 10847, "net_rx_dropped": 0, "net_rx_errors": 0, - "net_rx_packets": 33273, - "net_tx_bytes": 1507798, + "net_rx_packets": 143, + "net_traffic_control_statistics": [ + { + "backlog": 0, + "bytes": 163206809152, + "drops": 77147, + "id": "bw_limit", + "overlimits": 210693719, + "packets": 107941027, + "qlen": 10236, + "ratebps": 0, + "ratepps": 0, + "requeues": 0 + }, + { + "backlog": 15481368, + "bytes": 163206874168, + "drops": 27081494, + "id": "bloat_reduction", + "overlimits": 0, + "packets": 107941070, + "qlen": 10239, + "ratebps": 0, + "ratepps": 0, + "requeues": 0 + } + ], + "net_tx_bytes": 163200529816, "net_tx_dropped": 0, "net_tx_errors": 0, - "net_tx_packets": 17726, - "timestamp": 1408043826.91626 + "net_tx_packets": 107936874, + "perf": { + "duration": 0, + "timestamp": 1436298855.82807 + }, + "timestamp": 1436300487.41595 } } ] - - -# Network Egress Rate Limit - -Mesos 0.21.0 adds an optional feature to limit the egress network bandwidth for each container. With this feature enabled, each container's egress traffic is limited to the specified rate. This can prevent a single container from dominating the entire network. - -## How to enable it? - -Egress Rate Limit requires Network Monitoring. To enable it, please follow all the steps in the [previous section](#Network_Monitoring) to enable the Network Monitoring first, and then use the newly introduced `egress_rate_limit_per_container` flag to specify the rate limit for each container. Note that this flag expects a `Bytes` type like the following: - - mesos-slave \ - --checkpoint \ - --log_dir=/var/log/mesos \ - --work_dir=/var/lib/mesos \ - --isolation=cgroups/cpu,cgroups/mem,network/port_mapping \ - --resources=cpus:22;mem:62189;ports:[31000-32000];disk:400000;ephemeral_ports:[32768-57344] \ - --ephemeral_ports_per_container=1024 \ - --egress_rate_limit_per_container=37500KB # Convert to ~300Mbits/s. Modified: mesos/site/source/documentation/latest/slave-recovery.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/slave-recovery.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/slave-recovery.md (original) +++ mesos/site/source/documentation/latest/slave-recovery.md Mon Jul 13 22:09:25 2015 @@ -63,6 +63,21 @@ As part of this feature, `FrameworkInfo` > NOTE: Frameworks that have enabled checkpointing will only get offers from checkpointing slaves. So, before setting `checkpoint=True` on FrameworkInfo, ensure that there are slaves in your cluster that have enabled checkpointing. > Because, if there are no checkpointing slaves, the framework would not get any offers and hence cannot launch any tasks/executors! +## Known issues with `systemd` and POSIX isolation + +There is a known issue when using `systemd` to launch the `mesos-slave` while also using only `posix` isolation mechanisms that prevents tasks from recovering. The problem is that the default [KillMode](http://www.freedesktop.org/software/systemd/man/systemd.kill.html) for systemd processes is `cgroup` and hence all child processes are killed when the slave stops. Explicitly setting `KillMode` to `process` allows the executors to survive and reconnect. + +The following excerpt of a `systemd` unit configuration file shows how to set the flag: + +``` +[Service] +ExecStart=/usr/bin/mesos-slave +KillMode=process +``` + +> NOTE: There are also known issues with using `systemd` and raw `cgroups` based isolation, for now the suggested non-Posix isolation mechanism is to use Docker containerization. + + ## Upgrading to 0.14.0 If you want to upgrade a running Mesos cluster to 0.14.0 to take advantage of slave recovery please follow the [upgrade instructions](/documentation/latest/upgrades/). Modified: mesos/site/source/documentation/latest/tools.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/tools.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/tools.md (original) +++ mesos/site/source/documentation/latest/tools.md Mon Jul 13 22:09:25 2015 @@ -10,7 +10,6 @@ These tools make it easy to set up and r * [collectd plugin](https://github.com/rayrod2030/collectd-mesos) to collect Mesos cluster metrics. * [Deploy scripts](/documentation/latest/deploy-scripts/) for launching a Mesos cluster on a set of machines. -* [EC2 scripts](/documentation/latest/ec2-scripts/) for launching a Mesos cluster on Amazon EC2. * [Chef cookbook by Everpeace](https://github.com/everpeace/cookbook-mesos) Install Mesos and configure master and slave. This cookbook supports installation from source or the Mesosphere packages. * [Chef cookbook by Mdsol](https://github.com/mdsol/mesos_cookbook) Application cookbook for installing the Apache Mesos cluster manager. This cookbook installs Mesos via packages provided by Mesosphere. * [Puppet Module by Deric](https://github.com/deric/puppet-mesos) This is a Puppet module for managing Mesos nodes in a cluster. Modified: mesos/site/source/documentation/latest/upgrades.md URL: http://svn.apache.org/viewvc/mesos/site/source/documentation/latest/upgrades.md?rev=1690830&r1=1690829&r2=1690830&view=diff ============================================================================== --- mesos/site/source/documentation/latest/upgrades.md (original) +++ mesos/site/source/documentation/latest/upgrades.md Mon Jul 13 22:09:25 2015 @@ -16,6 +16,16 @@ This document serves as a guide for user **NOTE** The Resource protobuf has been extended to include more metadata for supporting persistence (DiskInfo), dynamic reservations (ReservationInfo) and oversubscription (RevocableInfo). You must not combine two Resource objects if they have different metadata. +In order to upgrade a running cluster: + +* Rebuild and install any modules so that upgraded masters/slaves can use them. +* Install the new master binaries and restart the masters. +* Install the new slave binaries and restart the slaves. +* Upgrade the schedulers by linking the latest native library / jar / egg (if necessary). +* Restart the schedulers. +* Upgrade the executors by linking the latest native library / jar / egg (if necessary). + + ## Upgrading from 0.21.x to 0.22.x **NOTE** Slave checkpoint flag has been removed as it will be enabled for all