Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7A0EA10731 for ; Tue, 3 Mar 2015 19:31:53 +0000 (UTC) Received: (qmail 21227 invoked by uid 500); 3 Mar 2015 19:31:40 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 20829 invoked by uid 500); 3 Mar 2015 19:31:40 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 20114 invoked by uid 99); 3 Mar 2015 19:31:40 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Mar 2015 19:31:40 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id E8A52E03B9; Tue, 3 Mar 2015 19:31:39 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: zjshen@apache.org To: common-commits@hadoop.apache.org Date: Tue, 03 Mar 2015 19:31:49 -0000 Message-Id: In-Reply-To: <904c2cf0a64649c19c869743df7bb26c@git.apache.org> References: <904c2cf0a64649c19c869743df7bb26c@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [11/43] hadoop git commit: YARN-3168. Convert site documentation from apt to markdown (Gururaj Shetty via aw) http://git-wip-us.apache.org/repos/asf/hadoop/blob/2e44b75f/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRest.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRest.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRest.apt.vm deleted file mode 100644 index 36b8621..0000000 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRest.apt.vm +++ /dev/null @@ -1,645 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - NodeManager REST API's. - --- - --- - ${maven.build.timestamp} - -NodeManager REST API's. - -%{toc|section=1|fromDepth=0|toDepth=2} - -* Overview - - The NodeManager REST API's allow the user to get status on the node and information about applications and containers running on that node. - -* NodeManager Information API - - The node information resource provides overall information about that particular node. - -** URI - - Both of the following URI's give you the cluster information. - ------- - * http:///ws/v1/node - * http:///ws/v1/node/info ------- - -** HTTP Operations Supported - ------- - * GET ------- - -** Query Parameters Supported - ------- - None ------- - -** Elements of the object - -*---------------+--------------+-------------------------------+ -|| Item || Data Type || Description | -*---------------+--------------+-------------------------------+ -| id | long | The NodeManager id | -*---------------+--------------+-------------------------------+ -| nodeHostName | string | The host name of the NodeManager | -*---------------+--------------+-------------------------------+ -| totalPmemAllocatedContainersMB | long | The amount of physical memory allocated for use by containers in MB | -*---------------+--------------+-------------------------------+ -| totalVmemAllocatedContainersMB | long | The amount of virtual memory allocated for use by containers in MB | -*---------------+--------------+-------------------------------+ -| totalVCoresAllocatedContainers | long | The number of virtual cores allocated for use by containers | -*---------------+--------------+-------------------------------+ -| lastNodeUpdateTime | long | The last timestamp at which the health report was received (in ms since epoch)| -*---------------+--------------+-------------------------------+ -| healthReport | string | The diagnostic health report of the node | -*---------------+--------------+-------------------------------+ -| nodeHealthy | boolean | true/false indicator of if the node is healthy| -*---------------+--------------+-------------------------------+ -| nodeManagerVersion | string | Version of the NodeManager | -*---------------+--------------+-------------------------------+ -| nodeManagerBuildVersion | string | NodeManager build string with build version, user, and checksum | -*---------------+--------------+-------------------------------+ -| nodeManagerVersionBuiltOn | string | Timestamp when NodeManager was built(in ms since epoch) | -*---------------+--------------+-------------------------------+ -| hadoopVersion | string | Version of hadoop common | -*---------------+--------------+-------------------------------+ -| hadoopBuildVersion | string | Hadoop common build string with build version, user, and checksum | -*---------------+--------------+-------------------------------+ -| hadoopVersionBuiltOn | string | Timestamp when hadoop common was built(in ms since epoch) | -*---------------+--------------+-------------------------------+ - -** Response Examples - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/info ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/json - Transfer-Encoding: chunked - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ -{ - "nodeInfo" : { - "hadoopVersionBuiltOn" : "Mon Jan 9 14:58:42 UTC 2012", - "nodeManagerBuildVersion" : "0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c", - "lastNodeUpdateTime" : 1326222266126, - "totalVmemAllocatedContainersMB" : 17203, - "totalVCoresAllocatedContainers" : 8, - "nodeHealthy" : true, - "healthReport" : "", - "totalPmemAllocatedContainersMB" : 8192, - "nodeManagerVersionBuiltOn" : "Mon Jan 9 15:01:59 UTC 2012", - "nodeManagerVersion" : "0.23.1-SNAPSHOT", - "id" : "host.domain.com:8041", - "hadoopBuildVersion" : "0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00", - "nodeHostName" : "host.domain.com", - "hadoopVersion" : "0.23.1-SNAPSHOT" - } -} -+---+ - - <> - - HTTP Request: - ------ - Accept: application/xml - GET http:///ws/v1/node/info ------ - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/xml - Content-Length: 983 - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ - - - - 17203 - 8192 - 8 - 1326222386134 - true - 0.23.1-SNAPSHOT - 0.23.1-SNAPSHOT from 1228355 by user1 source checksum 20647f76c36430e888cc7204826a445c - Mon Jan 9 15:01:59 UTC 2012 - 0.23.1-SNAPSHOT - 0.23.1-SNAPSHOT from 1228292 by user1 source checksum 3eba233f2248a089e9b28841a784dd00 - Mon Jan 9 14:58:42 UTC 2012 - host.domain.com:8041 - host.domain.com - -+---+ - -* Applications API - - With the Applications API, you can obtain a collection of resources, each of which represents an application. When you run a GET operation on this resource, you obtain a collection of Application Objects. See also {{Application API}} for syntax of the application object. - -** URI - ------- - * http:///ws/v1/node/apps ------- - -** HTTP Operations Supported - ------- - * GET ------- - -** Query Parameters Supported - - Multiple paramters can be specified. - ------- - * state - application state - * user - user name ------- - -** Elements of the (Applications) object - - When you make a request for the list of applications, the information will be returned as a collection of app objects. - See also {{Application API}} for syntax of the app object. - -*---------------+--------------+-------------------------------+ -|| Item || Data Type || Description | -*---------------+--------------+-------------------------------+ -| app | array of app objects(JSON)/zero or more app objects(XML) | A collection of application objects | -*---------------+--------------+--------------------------------+ - -** Response Examples - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/apps ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/json - Transfer-Encoding: chunked - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ -{ - "apps" : { - "app" : [ - { - "containerids" : [ - "container_1326121700862_0003_01_000001", - "container_1326121700862_0003_01_000002" - ], - "user" : "user1", - "id" : "application_1326121700862_0003", - "state" : "RUNNING" - }, - { - "user" : "user1", - "id" : "application_1326121700862_0002", - "state" : "FINISHED" - } - ] - } -} -+---+ - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/apps - Accept: application/xml ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/xml - Content-Length: 400 - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ - - - - application_1326121700862_0002 - FINISHED - user1 - - - application_1326121700862_0003 - RUNNING - user1 - container_1326121700862_0003_01_000002 - container_1326121700862_0003_01_000001 - - - -+---+ - -* {Application API} - - An application resource contains information about a particular application that was run or is running on this NodeManager. - -** URI - - Use the following URI to obtain an app Object, for a application identified by the {appid} value. - ------- - * http:///ws/v1/node/apps/{appid} ------- - -** HTTP Operations Supported - ------- - * GET ------- - -** Query Parameters Supported - ------- - None ------- - -** Elements of the (Application) object - -*---------------+--------------+-------------------------------+ -|| Item || Data Type || Description | -*---------------+--------------+-------------------------------+ -| id | string | The application id | -*---------------+--------------+--------------------------------+ -| user | string | The user who started the application | -*---------------+--------------+--------------------------------+ -| state | string | The state of the application - valid states are: NEW, INITING, RUNNING, FINISHING_CONTAINERS_WAIT, APPLICATION_RESOURCES_CLEANINGUP, FINISHED | -*---------------+--------------+--------------------------------+ -| containerids | array of containerids(JSON)/zero or more containerids(XML) | The list of containerids currently being used by the application on this node. If not present then no containers are currently running for this application.| -*---------------+--------------+--------------------------------+ - -** Response Examples - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/apps/application_1326121700862_0005 ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/json - Transfer-Encoding: chunked - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ -{ - "app" : { - "containerids" : [ - "container_1326121700862_0005_01_000003", - "container_1326121700862_0005_01_000001" - ], - "user" : "user1", - "id" : "application_1326121700862_0005", - "state" : "RUNNING" - } -} -+---+ - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/apps/application_1326121700862_0005 - Accept: application/xml ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/xml - Content-Length: 281 - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ - - - application_1326121700862_0005 - RUNNING - user1 - container_1326121700862_0005_01_000003 - container_1326121700862_0005_01_000001 - -+---+ - - -* Containers API - - With the containers API, you can obtain a collection of resources, each of which represents a container. When you run a GET operation on this resource, you obtain a collection of Container Objects. See also {{Container API}} for syntax of the container object. - -** URI - ------- - * http:///ws/v1/node/containers ------- - -** HTTP Operations Supported - ------- - * GET ------- - -** Query Parameters Supported - ------- - None ------- - -** Elements of the object - - When you make a request for the list of containers, the information will be returned as collection of container objects. - See also {{Container API}} for syntax of the container object. - -*---------------+--------------+-------------------------------+ -|| Item || Data Type || Description | -*---------------+--------------+-------------------------------+ -| containers | array of container objects(JSON)/zero or more container objects(XML) | A collection of container objects | -*---------------+--------------+-------------------------------+ - -** Response Examples - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/containers ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/json - Transfer-Encoding: chunked - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ -{ - "containers" : { - "container" : [ - { - "nodeId" : "host.domain.com:8041", - "totalMemoryNeededMB" : 2048, - "totalVCoresNeeded" : 1, - "state" : "RUNNING", - "diagnostics" : "", - "containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1", - "user" : "user1", - "id" : "container_1326121700862_0006_01_000001", - "exitCode" : -1000 - }, - { - "nodeId" : "host.domain.com:8041", - "totalMemoryNeededMB" : 2048, - "totalVCoresNeeded" : 2, - "state" : "RUNNING", - "diagnostics" : "", - "containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1", - "user" : "user1", - "id" : "container_1326121700862_0006_01_000003", - "exitCode" : -1000 - } - ] - } -} -+---+ - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/containers - Accept: application/xml ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/xml - Content-Length: 988 - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ - - - - container_1326121700862_0006_01_000001 - RUNNING - -1000 - - user1 - 2048 - 1 - http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000001/user1 - host.domain.com:8041 - - - container_1326121700862_0006_01_000003 - DONE - 0 - Container killed by the ApplicationMaster. - user1 - 2048 - 2 - http://host.domain.com:8042/node/containerlogs/container_1326121700862_0006_01_000003/user1 - host.domain.com:8041 - - -+---+ - - -* {Container API} - - A container resource contains information about a particular container that is running on this NodeManager. - -** URI - - Use the following URI to obtain a Container Object, from a container identified by the {containerid} value. - ------- - * http:///ws/v1/node/containers/{containerid} ------- - -** HTTP Operations Supported - ------- - * GET ------- - -** Query Parameters Supported - ------- - None ------- - -** Elements of the object - -*---------------+--------------+-------------------------------+ -|| Item || Data Type || Description | -*---------------+--------------+-------------------------------+ -| id | string | The container id | -*---------------+--------------+-------------------------------+ -| state | string | State of the container - valid states are: NEW, LOCALIZING, LOCALIZATION_FAILED, LOCALIZED, RUNNING, EXITED_WITH_SUCCESS, EXITED_WITH_FAILURE, KILLING, CONTAINER_CLEANEDUP_AFTER_KILL, CONTAINER_RESOURCES_CLEANINGUP, DONE| -*---------------+--------------+-------------------------------+ -| nodeId | string | The id of the node the container is on| -*---------------+--------------+-------------------------------+ -| containerLogsLink | string | The http link to the container logs | -*---------------+--------------+-------------------------------+ -| user | string | The user name of the user which started the container| -*---------------+--------------+-------------------------------+ -| exitCode | int | Exit code of the container | -*---------------+--------------+-------------------------------+ -| diagnostics | string | A diagnostic message for failed containers | -*---------------+--------------+-------------------------------+ -| totalMemoryNeededMB | long | Total amout of memory needed by the container (in MB) | -*---------------+--------------+-------------------------------+ -| totalVCoresNeeded | long | Total number of virtual cores needed by the container | -*---------------+--------------+-------------------------------+ - -** Response Examples - - <> - - HTTP Request: - ------- - GET http:///ws/v1/nodes/containers/container_1326121700862_0007_01_000001 ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/json - Transfer-Encoding: chunked - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ -{ - "container" : { - "nodeId" : "host.domain.com:8041", - "totalMemoryNeededMB" : 2048, - "totalVCoresNeeded" : 1, - "state" : "RUNNING", - "diagnostics" : "", - "containerLogsLink" : "http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1", - "user" : "user1", - "id" : "container_1326121700862_0007_01_000001", - "exitCode" : -1000 - } -} -+---+ - - <> - - HTTP Request: - ------- - GET http:///ws/v1/node/containers/container_1326121700862_0007_01_000001 - Accept: application/xml ------- - - Response Header: - -+---+ - HTTP/1.1 200 OK - Content-Type: application/xml - Content-Length: 491 - Server: Jetty(6.1.26) -+---+ - - Response Body: - -+---+ - - - container_1326121700862_0007_01_000001 - RUNNING - -1000 - - user1 - 2048 - 1 - http://host.domain.com:8042/node/containerlogs/container_1326121700862_0007_01_000001/user1 - host.domain.com:8041 - -+---+ - http://git-wip-us.apache.org/repos/asf/hadoop/blob/2e44b75f/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm deleted file mode 100644 index ba03f4e..0000000 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/NodeManagerRestart.apt.vm +++ /dev/null @@ -1,86 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - NodeManager Restart - --- - --- - ${maven.build.timestamp} - -NodeManager Restart - -* Introduction - - This document gives an overview of NodeManager (NM) restart, a feature that - enables the NodeManager to be restarted without losing - the active containers running on the node. At a high level, the NM stores any - necessary state to a local state-store as it processes container-management - requests. When the NM restarts, it recovers by first loading state for - various subsystems and then letting those subsystems perform recovery using - the loaded state. - -* Enabling NM Restart - - [[1]] To enable NM Restart functionality, set the following property in <> to true: - -*--------------------------------------+--------------------------------------+ -|| Property || Value | -*--------------------------------------+--------------------------------------+ -| <<>> | | -| | <<>>, (default value is set to false) | -*--------------------------------------+--------------------------------------+ - - [[2]] Configure a path to the local file-system directory where the - NodeManager can save its run state - -*--------------------------------------+--------------------------------------+ -|| Property || Description | -*--------------------------------------+--------------------------------------+ -| <<>> | | -| | The local filesystem directory in which the node manager will store state | -| | when recovery is enabled. | -| | The default value is set to | -| | <<<${hadoop.tmp.dir}/yarn-nm-recovery>>>. | -*--------------------------------------+--------------------------------------+ - - [[3]] Configure a valid RPC address for the NodeManager - -*--------------------------------------+--------------------------------------+ -|| Property || Description | -*--------------------------------------+--------------------------------------+ -| <<>> | | -| | Ephemeral ports (port 0, which is default) cannot be used for the | -| | NodeManager's RPC server specified via yarn.nodemanager.address as it can | -| | make NM use different ports before and after a restart. This will break any | -| | previously running clients that were communicating with the NM before | -| | restart. Explicitly setting yarn.nodemanager.address to an address with | -| | specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling | -| | NM restart. | -*--------------------------------------+--------------------------------------+ - - [[4]] Auxiliary services - - NodeManagers in a YARN cluster can be configured to run auxiliary services. - For a completely functional NM restart, YARN relies on any auxiliary service - configured to also support recovery. This usually includes (1) avoiding usage - of ephemeral ports so that previously running clients (in this case, usually - containers) are not disrupted after restart and (2) having the auxiliary - service itself support recoverability by reloading any previous state when - NodeManager restarts and reinitializes the auxiliary service. - - A simple example for the above is the auxiliary service 'ShuffleHandler' for - MapReduce (MR). ShuffleHandler respects the above two requirements already, - so users/admins don't have do anything for it to support NM restart: (1) The - configuration property <> controls which port the - ShuffleHandler on a NodeManager host binds to, and it defaults to a - non-ephemeral port. (2) The ShuffleHandler service also already supports - recovery of previous state after NM restarts. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/hadoop/blob/2e44b75f/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerHA.apt.vm ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerHA.apt.vm b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerHA.apt.vm deleted file mode 100644 index 0346cda..0000000 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerHA.apt.vm +++ /dev/null @@ -1,233 +0,0 @@ -~~ Licensed under the Apache License, Version 2.0 (the "License"); -~~ you may not use this file except in compliance with the License. -~~ You may obtain a copy of the License at -~~ -~~ http://www.apache.org/licenses/LICENSE-2.0 -~~ -~~ Unless required by applicable law or agreed to in writing, software -~~ distributed under the License is distributed on an "AS IS" BASIS, -~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -~~ See the License for the specific language governing permissions and -~~ limitations under the License. See accompanying LICENSE file. - - --- - ResourceManager High Availability - --- - --- - ${maven.build.timestamp} - -ResourceManager High Availability - -%{toc|section=1|fromDepth=0} - -* Introduction - - This guide provides an overview of High Availability of YARN's ResourceManager, - and details how to configure and use this feature. The ResourceManager (RM) - is responsible for tracking the resources in a cluster, and scheduling - applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager - is the single point of failure in a YARN cluster. The High Availability - feature adds redundancy in the form of an Active/Standby ResourceManager pair - to remove this otherwise single point of failure. - -* Architecture - -[images/rm-ha-overview.png] Overview of ResourceManager High Availability - -** RM Failover - - ResourceManager HA is realized through an Active/Standby architecture - at - any point of time, one of the RMs is Active, and one or more RMs are in - Standby mode waiting to take over should anything happen to the Active. - The trigger to transition-to-active comes from either the admin (through CLI) - or through the integrated failover-controller when automatic-failover is - enabled. - -*** Manual transitions and failover - - When automatic failover is not enabled, admins have to manually transition - one of the RMs to Active. To failover from one RM to the other, they are - expected to first transition the Active-RM to Standby and transition a - Standby-RM to Active. All this can be done using the "<<>>" - CLI. - -*** Automatic failover - - The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to - decide which RM should be the Active. When the Active goes down or becomes - unresponsive, another RM is automatically elected to be the Active which - then takes over. Note that, there is no need to run a separate ZKFC daemon - as is the case for HDFS because ActiveStandbyElector embedded in RMs acts - as a failure detector and a leader elector instead of a separate ZKFC - deamon. - -*** Client, ApplicationMaster and NodeManager on RM failover - - When there are multiple RMs, the configuration (yarn-site.xml) used by - clients and nodes is expected to list all the RMs. Clients, - ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to the RMs in - a round-robin fashion until they hit the Active RM. If the Active goes down, - they resume the round-robin polling until they hit the "new" Active. - This default retry logic is implemented as - <<>>. - You can override the logic by - implementing <<>> and - setting the value of <<>> to - the class name. - -** Recovering prevous active-RM's state - - With the {{{./ResourceManagerRestart.html}ResourceManger Restart}} enabled, - the RM being promoted to an active state loads the RM internal state and - continues to operate from where the previous active left off as much as - possible depending on the RM restart feature. A new attempt is spawned for - each managed application previously submitted to the RM. Applications can - checkpoint periodically to avoid losing any work. The state-store must be - visible from the both of Active/Standby RMs. Currently, there are two - RMStateStore implementations for persistence - FileSystemRMStateStore - and ZKRMStateStore. The <<>> implicitly allows write access - to a single RM at any point in time, and hence is the recommended store to - use in an HA cluster. When using the ZKRMStateStore, there is no need for a - separate fencing mechanism to address a potential split-brain situation - where multiple RMs can potentially assume the Active role. - - -* Deployment - -** Configurations - - Most of the failover functionality is tunable using various configuration - properties. Following is a list of required/important ones. yarn-default.xml - carries a full-list of knobs. See - {{{../hadoop-yarn-common/yarn-default.xml}yarn-default.xml}} - for more information including default values. - See {{{./ResourceManagerRestart.html}the document for ResourceManger - Restart}} also for instructions on setting up the state-store. - -*-------------------------+----------------------------------------------+ -|| Configuration Property || Description | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.zk-address | | -| | Address of the ZK-quorum. -| | Used both for the state-store and embedded leader-election. -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.ha.enabled | | -| | Enable RM HA -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.ha.rm-ids | | -| | List of logical IDs for the RMs. | -| | e.g., "rm1,rm2" | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.hostname. | | -| | For each , specify the hostname the | -| | RM corresponds to. Alternately, one could set each of the RM's service | -| | addresses. | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.ha.id | | -| | Identifies the RM in the ensemble. This is optional; | -| | however, if set, admins have to ensure that all the RMs have their own | -| | IDs in the config | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.ha.automatic-failover.enabled | | -| | Enable automatic failover; | -| | By default, it is enabled only when HA is enabled. | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.ha.automatic-failover.embedded | | -| | Use embedded leader-elector | -| | to pick the Active RM, when automatic failover is enabled. By default, | -| | it is enabled only when HA is enabled. | -*-------------------------+----------------------------------------------+ -| yarn.resourcemanager.cluster-id | | -| | Identifies the cluster. Used by the elector to | -| | ensure an RM doesn't take over as Active for another cluster. | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-proxy-provider | | -| | The class to be used by Clients, AMs and NMs to failover to the Active RM. | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-max-attempts | | -| | The max number of times FailoverProxyProvider should attempt failover. | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-sleep-base-ms | | -| | The sleep base (in milliseconds) to be used for calculating | -| | the exponential delay between failovers. | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-sleep-max-ms | | -| | The maximum sleep time (in milliseconds) between failovers | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-retries | | -| | The number of retries per attempt to connect to a ResourceManager. | -*-------------------------+----------------------------------------------+ -| yarn.client.failover-retries-on-socket-timeouts | | -| | The number of retries per attempt to connect to a ResourceManager on socket timeouts. | -*-------------------------+----------------------------------------------+ - -*** Sample configurations - - Here is the sample of minimal setup for RM failover. - -+---+ - - yarn.resourcemanager.ha.enabled - true - - - yarn.resourcemanager.cluster-id - cluster1 - - - yarn.resourcemanager.ha.rm-ids - rm1,rm2 - - - yarn.resourcemanager.hostname.rm1 - master1 - - - yarn.resourcemanager.hostname.rm2 - master2 - - - yarn.resourcemanager.zk-address - zk1:2181,zk2:2181,zk3:2181 - -+---+ - -** Admin commands - - <<>> has a few HA-specific command options to check the health/state of an - RM, and transition to Active/Standby. - Commands for HA take service id of RM set by <<>> - as argument. - -+---+ - $ yarn rmadmin -getServiceState rm1 - active - - $ yarn rmadmin -getServiceState rm2 - standby -+---+ - - If automatic failover is enabled, you can not use manual transition command. - Though you can override this by --forcemanual flag, you need caution. - -+---+ - $ yarn rmadmin -transitionToStandby rm1 - Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@1d8299fd - Refusing to manually manage HA state, since it may cause - a split-brain scenario or other incorrect state. - If you are very sure you know what you are doing, please - specify the forcemanual flag. -+---+ - - See {{{./YarnCommands.html}YarnCommands}} for more details. - -** ResourceManager Web UI services - - Assuming a standby RM is up and running, the Standby automatically redirects - all web requests to the Active, except for the "About" page. - -** Web Services - - Assuming a standby RM is up and running, RM web-services described at - {{{./ResourceManagerRest.html}ResourceManager REST APIs}} when invoked on - a standby RM are automatically redirected to the Active RM.