From commits-return-45963-archive-asf-public=cust-asf.ponee.io@qpid.apache.org Mon Jul 2 16:27:39 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id F14D81807A3 for ; Mon, 2 Jul 2018 16:27:35 +0200 (CEST) Received: (qmail 7662 invoked by uid 500); 2 Jul 2018 14:27:34 -0000 Mailing-List: contact commits-help@qpid.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@qpid.apache.org Delivered-To: mailing list commits@qpid.apache.org Received: (qmail 4809 invoked by uid 99); 2 Jul 2018 14:27:32 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Jul 2018 14:27:32 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 923B3E11B7; Mon, 2 Jul 2018 14:27:31 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: robbie@apache.org To: commits@qpid.apache.org Date: Mon, 02 Jul 2018 14:28:13 -0000 Message-Id: <218ae7e51dd148ad997e3912d25ac593@git.apache.org> In-Reply-To: <41b92f51deb445debdd8ef240b8595bd@git.apache.org> References: <41b92f51deb445debdd8ef240b8595bd@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [44/51] [partial] qpid-site git commit: tidy out some site content for the oldest releases http://git-wip-us.apache.org/repos/asf/qpid-site/blob/fb1899b6/content/releases/qpid-cpp-0.34/cpp-broker/book/chapter-ha.html ---------------------------------------------------------------------- diff --git a/content/releases/qpid-cpp-0.34/cpp-broker/book/chapter-ha.html b/content/releases/qpid-cpp-0.34/cpp-broker/book/chapter-ha.html deleted file mode 100644 index 0b44121..0000000 --- a/content/releases/qpid-cpp-0.34/cpp-broker/book/chapter-ha.html +++ /dev/null @@ -1,930 +0,0 @@ - - - - - 1.12. Active-Passive Messaging Clusters - Apache Qpid™ - - - - - - - - - - - - - -
- - - - - - -
- - -
-

1.12. Active-Passive Messaging Clusters

1.12.1. Overview

- - The High Availability (HA) module provides - active-passive, hot-standby - messaging clusters to provide fault tolerant message delivery. -

- In an active-passive cluster only one broker, known as the - primary, is active and serving clients at a time. The other - brokers are standing by as backups. Changes on the primary - are replicated to all the backups so they are always up-to-date or "hot". Backup - brokers reject client connection attempts, to enforce the requirement that clients - only connect to the primary. -

- If the primary fails, one of the backups is promoted to take over as the new - primary. Clients fail-over to the new primary automatically. If there are multiple - backups, the other backups also fail-over to become backups of the new primary. -

- This approach relies on an external cluster resource manager - to detect failures, choose the new primary and handle network partitions. rgmanager is supported - initially, but others may be supported in the future. -

1.12.1.1. Avoiding message loss

- In order to avoid message loss, the primary broker delays - acknowledgement of messages received from clients until the - message has been replicated and acknowledged by all of the back-up - brokers, or has been consumed from the primary queue. -

- This ensures that all acknowledged messages are safe: they have either - been consumed or backed up to all backup brokers. Messages that are - consumed before they are replicated do not need to - be replicated. This reduces the work load when replicating a queue with - active consumers. -

- Clients keep unacknowledged messages in a buffer - [1] - until they are acknowledged by the primary. If the primary fails, clients will - fail-over to the new primary and re-send all their - unacknowledged messages. - [2] -

- If the primary crashes, all the acknowledged - messages will be available on the backup that takes over as the new - primary. The unacknowledged messages will be - re-sent by the clients. Thus no messages are lost. -

- Note that this means it is possible for messages to be - duplicated. In the event of a failure it is possible for a - message to received by the backup that becomes the new primary - and re-sent by the client. The application must take steps - to identify and eliminate duplicates. -

- When a new primary is promoted after a fail-over it is initially in - "recovering" mode. In this mode, it delays acknowledgement of messages - on behalf of all the backups that were connected to the previous - primary. This protects those messages against a failure of the new - primary until the backups have a chance to connect and catch up. -

- Not all messages need to be replicated to the back-up brokers. If a - message is consumed and acknowledged by a regular client before it has - been replicated to a backup, then it doesn't need to be replicated. -

HA Broker States

Stand-alone

- Broker is not part of a HA cluster. -

Joining

- Newly started broker, not yet connected to any existing primary. -

Catch-up

- A backup broker that is connected to the primary and downloading - existing state (queues, messages etc.) -

Ready

- A backup broker that is fully caught-up and ready to take over as - primary. -

Recovering

- Newly-promoted primary, waiting for backups to connect and catch up. - Clients can connect but they are stalled until the primary is active. -

Active

- The active primary broker with all backups connected and caught-up. -

1.12.1.2. Limitations

- There are a some known limitations in the current implementation. These - will be fixed in future versions. -

  • - Transactional changes to queue state are not replicated atomically. If - the primary crashes during a transaction, it is possible that the - backup could contain only part of the changes introduced by a - transaction. -

  • - Configuration changes (creating or deleting queues, exchanges and - bindings) are replicated asynchronously. Management tools used to - make changes will consider the change complete when it is complete - on the primary, it may not yet be replicated to all the backups. -

  • - Federation links to the primary will fail over - correctly. Federated links from the primary - will be lost in fail over, they will not be re-connected to the new - primary. It is possible to work around this by replacing the - qpidd-primary start up script with a script that - re-creates federation links when the primary is promoted. -

1.12.2. Virtual IP Addresses

- Some resource managers (including rgmanager) support - virtual IP addresses. A virtual IP address is an IP - address that can be relocated to any of the nodes in a cluster. The - resource manager associates this address with the primary node in the - cluster, and relocates it to the new primary when there is a failure. This - simplifies configuration as you can publish a single IP address rather - than a list. -

- A virtual IP address can be used by clients to connect to the primary. The - following sections will explain how to configure virtual IP addresses for - clients or brokers. -

1.12.3. Configuring the Brokers

- The broker must load the ha module, it is loaded by - default. The following broker options are available for the HA module. -

Note

- Broker management is required for HA to operate, it is enabled by - default. The option mgmt-enable must not be set to - "no" -

Note

- Incorrect security settings are a common cause of problems when - getting started, see Section 1.12.9, “Security and Access Control.”. -

Table 1.28. Broker Options for High Availability Messaging Cluster

- Options for High Availability Messaging Cluster -
- ha-cluster yes|no - - Set to "yes" to have the broker join a cluster. -
- ha-queue-replication yes|no - - Enable replication of specific queues without joining a cluster, see Section 1.13, “Replicating Queues with the HA module”. -
- ha-brokers-url URL - -

- The URL - [a] - used by cluster brokers to connect to each other. The URL should - contain a comma separated list of the broker addresses, rather than a - virtual IP address. -

-
ha-public-url URL -

- This option is only needed for backwards compatibility if you - have been using the amq.failover exchange. - This exchange is now obsolete, it is recommended to use a - virtual IP address instead. -

-

- If set, this URL is advertised by the - amq.failover exchange and overrides the - broker option known-hosts-url -

-
ha-replicate VALUE -

- Specifies whether queues and exchanges are replicated by default. - VALUE is one of: none, - configuration, all. - For details see Section 1.12.7, “Controlling replication of queues and exchanges”. -

-
-

ha-username USER

-

ha-password PASS

-

ha-mechanism MECHANISM

-
- Authentication settings used by HA brokers to connect to each other, - see Section 1.12.9, “Security and Access Control.” -
ha-backup-timeoutSECONDS - [b] - -

- Maximum time that a recovering primary will wait for an expected - backup to connect and become ready. -

-
- link-maintenance-interval SECONDS - [b] - -

- HA uses federation links to connect from backup to primary. - Backup brokers check the link to the primary on this interval - and re-connect if need be. Default 2 seconds. Set lower for - faster failover, e.g. 0.1 seconds. Setting too low will result - in excessive link-checking on the backups. -

-
- link-heartbeat-interval SECONDS - [b] - -

- HA uses federation links to connect from backup to primary. - If no heart-beat is received for twice this interval the primary will consider that - backup dead (e.g. if backup is hung or partitioned.) - This interval is also used to time-out for broker status checks, - it may take up to this interval for rgmanager to detect a hung or partitioned broker. - Clients sending messages may be held up during this time. - Default 120 seconds: you will probably want to set this to a lower value e.g. 10. - If set too low rgmanager may consider a slow broker to have failed and kill it. -

-

[a] - The full format of the URL is given by this grammar: -

-url = ["amqp:"][ user ["/" password] "@" ] addr ("," addr)*
-addr = tcp_addr / rmda_addr / ssl_addr / ...
-tcp_addr = ["tcp:"] host [":" port]
-rdma_addr = "rdma:" host [":" port]
-ssl_addr = "ssl:" host [":" port]'
-		  

-

[b] - Values specified as SECONDS can be a - fraction of a second, e.g. "0.1" for a tenth of a second. - They can also have an explicit unit, - e.g. 10s (seconds), 10ms (milliseconds), 10us (microseconds), 10ns (nanoseconds) -


- To configure a HA cluster you must set at least ha-cluster and - ha-brokers-url. -

1.12.4. The Cluster Resource Manager

- Broker fail-over is managed by a cluster resource - manager. An integration with rgmanager is - provided, but it is possible to integrate with other resource managers. -

- The resource manager is responsible for starting the qpidd broker - on each node in the cluster. The resource manager then promotes - one of the brokers to be the primary. The other brokers connect to the primary as - backups, using the URL provided in the ha-brokers-url configuration - option. -

- Once connected, the backup brokers synchronize their state with the - primary. When a backup is synchronized, or "hot", it is ready to take - over if the primary fails. Backup brokers continually receive updates - from the primary in order to stay synchronized. -

- If the primary fails, backup brokers go into fail-over mode. The resource - manager must detect the failure and promote one of the backups to be the - new primary. The other backups connect to the new primary and synchronize - their state with it. -

- The resource manager is also responsible for protecting the cluster from - split-brain conditions resulting from a network partition. A - network partition divide a cluster into two sub-groups which cannot see each other. - Usually a quorum voting algorithm is used that disables nodes - in the inquorate sub-group. -

1.12.5. Configuring with rgmanager as resource manager

- This section assumes that you are already familiar with setting up and configuring - clustered services using cman and - rgmanager. It will show you how to configure an active-passive, - hot-standby qpidd HA cluster with rgmanager. -

Note

- Once all components are installed it is important to take the following step: -

-chkconfig rgmanager on
-chkconfig cman on
-chkconfig qpidd off
-	

-

- The qpidd service must be off in - chkconfig because rgmanager will - start and stop qpidd. If the normal system init - process also attempts to start and stop qpidd it can cause rgmanager to - lose track of qpidd processes. The symptom when this happens is that - clustat shows a qpidd service to - be stopped when in fact there is a qpidd process - running. The qpidd log will show errors like this: -

-critical Unexpected error: Daemon startup failed: Cannot lock /var/lib/qpidd/lock: Resource temporarily unavailable
-	

-

- You must provide a cluster.conf file to configure - cman and rgmanager. Here is - an example cluster.conf file for a cluster of 3 nodes named - node1, node2 and node3. We will go through the configuration step-by-step. -

-      
-<?xml version="1.0"?>
-<!--
-This is an example of a cluster.conf file to run qpidd HA under rgmanager.
-This example assumes a 3 node cluster, with nodes named node1, node2 and node3.
-
-NOTE: fencing is not shown, you must configure fencing appropriately for your cluster.
--->
-
-<cluster name="qpid-test" config_version="18">
-  <!-- The cluster has 3 nodes. Each has a unique nodeid and one vote
-       for quorum. -->
-  <clusternodes>
-    <clusternode name="node1.example.com" nodeid="1"/>
-    <clusternode name="node2.example.com" nodeid="2"/>
-    <clusternode name="node3.example.com" nodeid="3"/>
-  </clusternodes>
-
-  <!-- Resouce Manager configuration. -->
-
-   status_poll_interval is the interval in seconds that the resource manager checks the status
-   of managed services. This affects how quickly the manager will detect failed services.
-   -->
-  <rm status_poll_interval="1">
-    <!--
-	There is a failoverdomain for each node containing just that node.
-	This lets us stipulate that the qpidd service should always run on each node.
-    -->
-    <failoverdomains>
-      <failoverdomain name="node1-domain" restricted="1">
-	<failoverdomainnode name="node1.example.com"/>
-      </failoverdomain>
-      <failoverdomain name="node2-domain" restricted="1">
-	<failoverdomainnode name="node2.example.com"/>
-      </failoverdomain>
-      <failoverdomain name="node3-domain" restricted="1">
-	<failoverdomainnode name="node3.example.com"/>
-      </failoverdomain>
-    </failoverdomains>
-
-    <resources>
-      <!-- This script starts a qpidd broker acting as a backup. -->
-      <script file="/etc/init.d/qpidd" name="qpidd"/>
-
-      <!-- This script promotes the qpidd broker on this node to primary. -->
-      <script file="/etc/init.d/qpidd-primary" name="qpidd-primary"/>
-
-      <!--
-          This is a virtual IP address for client traffic.
-	  monitor_link="yes" means monitor the health of the NIC used for the VIP.
-	  sleeptime="0" means don't delay when failing over the VIP to a new address.
-      -->
-      <ip address="20.0.20.200" monitor_link="yes" sleeptime="0"/>
-    </resources>
-
-    <!-- There is a qpidd service on each node, it should be restarted if it fails. -->
-    <service name="node1-qpidd-service" domain="node1-domain" recovery="restart">
-      <script ref="qpidd"/>
-    </service>
-    <service name="node2-qpidd-service" domain="node2-domain" recovery="restart">
-      <script ref="qpidd"/>
-    </service>
-    <service name="node3-qpidd-service" domain="node3-domain"  recovery="restart">
-      <script ref="qpidd"/>
-    </service>
-
-    <!-- There should always be a single qpidd-primary service, it can run on any node. -->
-    <service name="qpidd-primary-service" autostart="1" exclusive="0" recovery="relocate">
-      <script ref="qpidd-primary"/>
-      <!-- The primary has the IP addresses for brokers and clients to connect. -->
-      <ip ref="20.0.20.200"/>
-    </service>
-  </rm>
-</cluster>
-      
-    

- There is a failoverdomain for each node containing just that - one node. This lets us stipulate that the qpidd service should always run on all - nodes. -

- The resources section defines the qpidd - script used to start the qpidd service. It also defines the - qpid-primary script which does not - actually start a new service, rather it promotes the existing - qpidd broker to primary status. -

- The resources section also defines a virtual IP - address for clients: 20.0.20.200. -

- qpidd.conf should contain these lines: -

-ha-cluster=yes
-ha-brokers-url=20.0.20.1,20.0.20.2,20.0.20.3
-    

- The brokers connect to each other directly via the addresses - listed in ha-brokers-url. Note the client and broker - addresses are on separate sub-nets, this is recommended but not required. -

- The service section defines 3 qpidd - services, one for each node. Each service is in a restricted fail-over - domain containing just that node, and has the restart - recovery policy. The effect of this is that rgmanager will run - qpidd on each node, restarting if it fails. -

- There is a single qpidd-primary-service using the - qpidd-primary script which is not restricted to a - domain and has the relocate recovery policy. This means - rgmanager will start qpidd-primary on one of the nodes - when the cluster starts and will relocate it to another node if the - original node fails. Running the qpidd-primary script - does not start a new broker process, it promotes the existing broker to - become the primary. -

1.12.5.1. Shutting down qpidd on a HA node

- As explained above both the per-node qpidd service - and the re-locatable qpidd-primary service are - implemented by the same qpidd daemon. -

- As a result, stopping the qpidd service will not stop - a qpidd daemon that is acting as primary, and - stopping the qpidd-primary service will not stop a - qpidd process that is acting as backup. -

- To shut down a node that is acting as primary you need to shut down the - qpidd service and relocate the - primary: -

-

-clusvcadm -d somenode-qpidd-service
-clusvcadm -r qpidd-primary-service
-        

-

- This will shut down the qpidd daemon on that node and - prevent the primary service service from relocating back to the node - because the qpidd service is no longer running there. -

1.12.6. Broker Administration Tools

- Normally, clients are not allowed to connect to a backup broker. However - management tools are allowed to connect to a backup brokers. If you use - these tools you must not add or remove messages from - replicated queues, nor create or delete replicated queues or exchanges as - this will disrupt the replication process and may cause message loss. -

- qpid-ha allows you to view and change HA configuration settings. -

- The tools qpid-config, qpid-route and - qpid-stat will connect to a backup if you pass the flag ha-admin on the - command line. -

1.12.7. Controlling replication of queues and exchanges

- By default, queues and exchanges are not replicated automatically. You can change - the default behaviour by setting the ha-replicate configuration - option. It has one of the following values: -

  • - all: Replicate everything automatically: queues, - exchanges, bindings and messages. -

  • - configuration: Replicate the existence of queues, - exchange and bindings but don't replicate messages. -

  • - none: Don't replicate anything, this is the default. -

-

- You can over-ride the default for a particular queue or exchange by passing the - argument qpid.replicate when creating the queue or exchange. It - takes the same values as ha-replicate -

- Bindings are automatically replicated if the queue and exchange being bound both - have replication all or configuration, they - are not replicated otherwise. -

- You can create replicated queues and exchanges with the - qpid-config management tool like this: -

-qpid-config add queue myqueue --replicate all
-    

- To create replicated queues and exchanges via the client API, add a - node entry to the address like this: -

-"myqueue;{create:always,node:{x-declare:{arguments:{'qpid.replicate':all}}}}"
-    

- There are some built-in exchanges created automatically by the broker, these - exchanges are never replicated. The built-in exchanges are the default (nameless) - exchange, the AMQP standard exchanges (amq.direct, amq.topic, amq.fanout and - amq.match) and the management exchanges (qpid.management, qmf.default.direct and - qmf.default.topic) -

- Note that if you bind a replicated queue to one of these exchanges, the - binding will not be replicated, so the queue will not - have the binding after a fail-over. -

1.12.8. Client Connection and Fail-over

- Clients can only connect to the primary broker. Backup brokers reject any - connection attempt by a client. Clients rejected by a backup broker will - automatically fail-over until they connect to the primary. -

- Clients are configured with the URL for the cluster (details below for - each type of client). There are two possibilities -

  • - The URL contains multiple addresses, one for each broker in the cluster. -

  • - The URL contains a single virtual IP address - that is assigned to the primary broker by the resource manager. - This is the recommended configuration. -

- In the first case, clients will repeatedly re-try each address in the URL - until they successfully connect to the primary. In the second case the - resource manager will assign the virtual IP address to the primary broker, - so clients only need to re-try on a single address. -

- When the primary broker fails, clients re-try all known cluster addresses - until they connect to the new primary. The client re-sends any messages - that were previously sent but not acknowledged by the broker at the time - of the failure. Similarly messages that have been sent by the broker, but - not acknowledged by the client, are re-queued. -

- TCP can be slow to detect connection failures. A client can configure a - connection to use a heartbeat to detect connection - failure, and can specify a time interval for the heartbeat. If heartbeats - are in use, failures will be detected no later than twice the heartbeat - interval. The following sections explain how to enable heartbeat in each - client. -

- Note: the following sections explain how to configure clients with - multiple dresses, but if you are using a virtual IP address you only need - to configure that one address for clients, you don't need to list all the - addresses. -

- Suppose your cluster has 3 nodes: node1, - node2 and node3 all using the - default AMQP port, and you are not using a virtual IP address. To connect - a client you need to specify the address(es) and set the - reconnect property to true. The - following sub-sections show how to connect each type of client. -

1.12.8.1. C++ clients

- With the C++ client, you specify multiple cluster addresses in a single URL - [3] - You also need to specify the connection option - reconnect to be true. For example: -

-qpid::messaging::Connection c("node1,node2,node3","{reconnect:true}");
-      

- Heartbeats are disabled by default. You can enable them by specifying a - heartbeat interval (in seconds) for the connection via the - heartbeat option. For example: -

-qpid::messaging::Connection c("node1,node2,node3","{reconnect:true,heartbeat:10}");
-      

1.12.8.2. Python clients

- With the python client, you specify reconnect=True - and a list of host:port addresses as - reconnect_urls when calling - Connection.establish or - Connection.open -

-connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"])
-      

- Heartbeats are disabled by default. You can - enable them by specifying a heartbeat interval (in seconds) for the - connection via the 'heartbeat' option. For example: -

-connection = qpid.messaging.Connection.establish("node1", reconnect=True, reconnect_urls=["node1", "node2", "node3"], heartbeat=10)
-      

1.12.8.3. Java JMS Clients

- In Java JMS clients, client fail-over is handled automatically if it is - enabled in the connection. You can configure a connection to use - fail-over using the failover property: -

-	connectionfactory.qpidConnectionfactory = amqp://guest:guest@clientid/test?brokerlist='tcp://localhost:5672'&failover='failover_exchange'
-      

- This property can take three values: -

Fail-over Modes

failover_exchange

- If the connection fails, fail over to any other broker in the cluster. -

roundrobin

- If the connection fails, fail over to one of the brokers specified in the brokerlist. -

singlebroker

- Fail-over is not supported; the connection is to a single broker only. -

- In a Connection URL, heartbeat is set using the heartbeat property, which is an integer corresponding to the heartbeat period in seconds. For instance, the following line from a JNDI properties file sets the heartbeat time out to 3 seconds: -

-	connectionfactory.qpidConnectionfactory = amqp://guest:guest@clientid/test?brokerlist='tcp://localhost:5672'&heartbeat='3'
-      

1.12.9. Security and Access Control.

- This section outlines the HA specific aspects of security configuration. - Please see Section 1.5, “Security” for - more details on enabling authentication and setting up Access Control Lists. -

Note

- Unless you disable authentication with auth=no in - your configuration, you must set the options below - and you must have an ACL file with at least the - entry described below. -

- Backups will be unable to connect to the primary if - the security configuration is incorrect. See also Section 1.12.12.2, “Authentication and ACL failures” -

- When authentication is enabled you must set the credentials used by HA - brokers with following options: -

Table 1.29. HA Security Options

- HA Security Options -

ha-username USER

User name for HA brokers. Note this must not include the @QPID suffix.

ha-password PASS

Password for HA brokers.

ha-mechanism MECHANISM

-

- Mechanism for HA brokers. Any mechanism you enable for - broker-to-broker communication can also be used by a client, so - do not use ha-mechanism=ANONYMOUS in a secure environment. -

-

- This identity is used to authorize federation links from backup to - primary. It is also used to authorize actions on the backup to replicate - primary state, for example creating queues and exchanges. -

- When authorization is enabled you must have an Access Control List with the - following rule to allow HA replication to function. Suppose - ha-username=USER -

-acl allow USER@QPID all all
-    

1.12.10. Integrating with other Cluster Resource Managers

- To integrate with a different resource manager you must configure it to: -

  • Start a qpidd process on each node of the cluster.

  • Restart qpidd if it crashes.

  • Promote exactly one of the brokers to primary.

  • Detect a failure and promote a new primary.

-

- The qpid-ha command allows you to check if a broker is - primary, and to promote a backup to primary. -

- To test if a broker is the primary: -

qpid-ha -b broker-address status --expect=primary

- This will return 0 if the broker at broker-address is the primary, - non-0 otherwise. -

- To promote a broker to primary: -

qpid-ha --cluster-manager -b broker-address promote

-

- Note that promote is considered a "cluster manager - only" command. Incorrect use of promote outside of the - cluster manager could create a cluster with multiple primaries. Such a - cluster will malfunction and lose data. "Cluster manager only" commands - are not accessible in qpid-ha without the - --cluster-manager option. -

- To list the full set of commands use: -

-qpid-ha --cluster-manager --help
-    

1.12.11. Using a message store in a cluster

- If you use a persistent store for your messages then each broker in a - cluster will have its own store. If the entire cluster fails and is - restarted, the *first* broker that becomes primary will recover from its - store. All the other brokers will clear their stores and get an update - from the primary to ensure consistency. -

1.12.12. Troubleshooting a cluster

- This section applies to clusters that are using rgmanager as the - cluster manager. -

1.12.12.1. No primary broker

- When you initially start a HA cluster, all brokers are in - joining mode. The brokers do not automatically select - a primary, they rely on the cluster manager rgmanager - to do so. If rgmanager is not running or is not - configured correctly, brokers will remain in the - joining state. See Section 1.12.5, “Configuring with rgmanager as resource manager” -

1.12.12.2. Authentication and ACL failures

- If a broker is unable to establish a connection to another broker in the - cluster due to authentication or ACL problems the logs may contain - errors like the following: -

-info SASL: Authentication failed: SASL(-13): user not found: Password verification failed
-	

-

-warning Client closed connection with 320: User anonymous@QPID federation connection denied. Systems with authentication enabled must specify ACL create link rules.
-	

-

-warning Client closed connection with 320: ACL denied anonymous@QPID creating a federation link.
-	

-

- Set the HA security configuration and ACL file as described in Section 1.12.9, “Security and Access Control.”. Once the cluster is running and the primary is - promoted , run: -

qpid-ha status --all

- to make sure that the brokers are running as one cluster. -

1.12.12.3. Slow recovery times

- The following configuration settings affect recovery time. The - values shown are examples that give fast recovery on a lightly - loaded system. You should run tests to determine if the values are - appropriate for your system and load conditions. -

cluster.conf:
-<rm status_poll_interval=1>
-	

- status_poll_interval is the interval in seconds that the - resource manager checks the status of managed services. This - affects how quickly the manager will detect failed services. -

-<ip address="20.0.20.200" monitor_link="yes" sleeptime="0"/>
-	

- This is a virtual IP address for client traffic. - monitor_link="yes" means monitor the health of the network interface - used for the VIP. sleeptime="0" means don't delay when - failing over the VIP to a new address. -

qpidd.conf
-link-maintenance-interval=0.1
-	

- Interval for backup brokers to check the link to the primary - re-connect if need be. Default 2 seconds. Can be set lower for - faster fail-over. Setting too low will result in excessive - link-checking activity on the broker. -

-link-heartbeat-interval=5
-	

- Heartbeat interval for federation links. The HA cluster uses - federation links between the primary and each backup. The - primary can take up to twice the heartbeat interval to detect a - failed backup. When a sender sends a message the primary waits - for all backups to acknowledge before acknowledging to the - sender. A disconnected backup may cause the primary to block - senders until it is detected via heartbeat. -

- This interval is also used as the timeout for broker status - checks by rgmanager. It may take up to this interval for - rgmanager to detect a hung broker. -

- The default of 120 seconds is very high, you will probably want - to set this to a lower value. If set too low, under network - congestion or heavy load, a slow-to-respond broker may be - re-started by rgmanager. -

1.12.12.4. Total cluster failure

- Note: for definition of broker states joining, - catch-up, ready, - recovering and active see - HA Broker States -

- The cluster can only guarantee availability as long as there is at - least one active primary broker or ready backup broker left alive. - If all the brokers fail simultaneously, the cluster will fail and - non-persistent data will be lost. -

- While there is an active primary broker, clients can get service. - If the active primary fails, one of the "ready" backup - brokers will take over, recover and become active. Note a backup - can only be promoted to primary if it is in the "ready" - state (with the exception of the first primary in a new cluster - where all brokers are in the "joining" state) -

- Given a stable cluster of N brokers with one active primary and - N-1 ready backups, the system can sustain up to N-1 failures in - rapid succession. The surviving broker will be promoted to active - and continue to give service. -

- However at this point the system cannot - sustain a failure of the surviving broker until at least one of - the other brokers recovers, catches up and becomes a ready backup. - If the surviving broker fails before that the cluster will fail in - one of two modes (depending on the exact timing of failures) -

1. The cluster hangs

- All brokers are in joining or catch-up mode. rgmanager tries to - promote a new primary but cannot find any candidates and so - gives up. clustat will show that the qpidd services are running - but the the qpidd-primary service has stopped, something like - this: -

-Service Name                   Owner (Last)                   State
-------- ----                   ----- ------                   -----
-service:mrg33-qpidd-service    20.0.10.33                     started
-service:mrg34-qpidd-service    20.0.10.34                     started
-service:mrg35-qpidd-service    20.0.10.35                     started
-service:qpidd-primary-service  (20.0.10.33)                   stopped
-	

- Eventually all brokers become stuck in "joining" mode, - as shown by: qpid-ha status --all -

- At this point you need to restart the cluster in one of the - following ways: -

  1. - Restart the entire cluster: - In luci:your-cluster:Nodes - click reboot to restart the entire cluster -

  2. - Stop and restart the cluster with - ccs --stopall; ccs --startall -

  3. - Restart just the Qpid services:In luci:your-cluster:Service Groups -

    1. Select all the qpidd (not qpidd-primary) services, click restart

    2. Select the qpidd-primary service, click restart

    -

  4. - Stop the qpidd-primary and - qpidd services with clusvcadm, - then restart (qpidd-primary last) -

-

2. The cluster reboots

- A new primary is promoted and the cluster is functional but all - non-persistent data from before the failure is lost. -

1.12.12.5. Fencing and network partitions

- A network partition is a a network failure that divides the - cluster into two or more sub-clusters, where each broker can - communicate with brokers in its own sub-cluster but not with - brokers in other sub-clusters. This condition is also referred to - as a "split brain". -

- Nodes in one sub-cluster can't tell whether nodes in other - sub-clusters are dead or are still running but disconnected. We - cannot allow each sub-cluster to independently declare its own - qpidd primary and start serving clients, as the cluster will - become inconsistent. We must ensure only one sub-cluster continues - to provide service. -

- A quorum determines which sub-cluster - continues to operate, and power fencing - ensures that nodes in non-quorate sub-clusters cannot attempt to - provide service inconsistently. For more information see: -

- https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html-single/High_Availability_Add-On_Overview/index.html, - chapter 2. Quorum and 4. Fencing. -



[1] - You can control the maximum number of messages in the buffer by setting the - client's capacity. For details of how to set the capacity - in client code see "Using the Qpid Messaging API" in - Programming in Apache Qpid. -

[2] - Clients must use "at-least-once" reliability to enable re-send of unacknowledged - messages. This is the default behaviour, no options need be set to enable it. For - details of client addressing options see "Using the Qpid Messaging API" - in Programming in Apache Qpid. -

[3] - The full grammar for the URL is: -

-url = ["amqp:"][ user ["/" password] "@" ] addr ("," addr)*
-addr = tcp_addr / rmda_addr / ssl_addr / ...
-tcp_addr = ["tcp:"] host [":" port]
-rdma_addr = "rdma:" host [":" port]
-ssl_addr = "ssl:" host [":" port]'
-	  
- -
- - - - -
-
-
- - http://git-wip-us.apache.org/repos/asf/qpid-site/blob/fb1899b6/content/releases/qpid-cpp-0.34/cpp-broker/book/css/style.css ---------------------------------------------------------------------- diff --git a/content/releases/qpid-cpp-0.34/cpp-broker/book/css/style.css b/content/releases/qpid-cpp-0.34/cpp-broker/book/css/style.css deleted file mode 100644 index c681596..0000000 --- a/content/releases/qpid-cpp-0.34/cpp-broker/book/css/style.css +++ /dev/null @@ -1,279 +0,0 @@ -/* - * - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - * - */ -ul { - list-style-type:square; -} - -th { - font-weight: bold; -} - -.navfooter td { - font-size:10pt; -} - -.navheader td { - font-size:10pt; -} - -body { - margin:0; - background:#FFFFFF; - font-family:"Verdana", sans-serif; - font-size:10pt; -} - -.container { - width:950px; - margin:0 auto; -} - -body a { - color:#000000; -} - - -div.book { - margin-left:10pt; - margin-right:10pt; -} - -div.preface { - margin-left:10pt; - margin-right:10pt; -} - -div.chapter { - margin-left:10pt; - margin-right:10pt; -} - -div.section { - margin-left:10pt; - margin-right:10pt; -} - -div.titlepage { - margin-left:-10pt; - margin-right:-10pt; -} - -.calloutlist td { - font-size:10pt; -} - -.table-contents table { - border-spacing: 0px; -} - -.table-contents td { - font-size:10pt; - padding-left:6px; - padding-right:6px; -} - -div.breadcrumbs { - font-size:9pt; - margin-right:10pt; - padding-bottom:16px; -} - -.chapter h2.title { - font-size:20pt; - color:#0c3b82; -} - -.chapter .section h2.title { - font-size:18pt; - color:#0c3b82; -} - -.section h2.title { - font-size:16pt; - color:#0c3b82; -} - -.section h3.title { - font-size:14pt; - color:#0c3b82; -} - -.section h4.title { - font-size:12pt; - color:#0c3b82; -} - -.section h5.title { - font-size:12pt; - color:#0c3b82; -} - -.section h6.title { - font-size:12pt; - color:#0c3b82; -} - -.toc a { - font-size:9pt; -} - -.header { - height:100px; - width:950px; - background:url(http://qpid.apache.org/images/header.png) -} - -.logo { - text-align:center; - font-weight:600; - padding:0 0 0 0; - font-size:14px; - font-family:"Verdana", cursive; -} - -.logo a { - color:#000000; - text-decoration:none; -} - -.main_text_area { - margin-left:200px; -} - -.main_text_area_top { - height:14px; - font-size:1px; -} - -.main_text_area_bottom { - display:none; -/* height:14px; - margin-bottom:4px;*/ -} - -.main_text_area_body { - padding:5px 24px; -} - -.main_text_area_body p { - text-align:justify; -} - -.main_text_area br { - line-height:10px; -} - -.main_text_area h1 { - font-size:28px; - font-weight:600; - margin:0 0 24px 0; - color:#0c3b82; - font-family:"Verdana", Times, serif; -} - -.main_text_area h2 { - font-size:24px; - font-weight:600; - margin:24px 0 8px 0; - color:#0c3b82; - font-family:"Verdana",Times, serif; -} - -.main_text_area ol, .main_text_area ul { - padding:0; - margin:10px 0; - margin-left:20px; -} - -.main_text_area li { -/* margin-left:40px; */ -} - -.main_text_area, .menu_box { - font-size:13px; - line-height:17px; - color:#000000; -} - -.main_text_area { - font-size:14px; -} - -.main_text_area a { - color:#000000; -} - -.main_text_area a:hover { - color:#000000; -} - -.menu_box { - width:196px; - float:left; - margin-left:4px; -} - -.menu_box_top { - background:url(http://qpid.apache.org/images/menu_top.png) no-repeat; - height:14px; - font-size:1px; -} - -.menu_box_body { - background:url(http://qpid.apache.org/images/menu_body.png) repeat-y; - padding:5px 24px 5px 24px; -} - -.menu_box_bottom { - background:url(http://qpid.apache.org/images/menu_bottom.png) no-repeat; - height:14px; - font-size:1px; - margin-bottom:1px; -} - -.menu_box h3 { - font-size:20px; - font-weight:500; - margin:0 0 8px 0; - color:#0c3b82; - font-family:"Verdana",Times, serif; -} - -.menu_box ul { - margin:12px; - padding:0px; -} - -.menu_box li { - list-style:square; -} - -.menu_box a { - color:#000000; - text-decoration:none; -} - -.menu_box a:hover { - color:#000000; - text-decoration:underline; -} - - http://git-wip-us.apache.org/repos/asf/qpid-site/blob/fb1899b6/content/releases/qpid-cpp-0.34/cpp-broker/book/ha-queue-replication.html ---------------------------------------------------------------------- diff --git a/content/releases/qpid-cpp-0.34/cpp-broker/book/ha-queue-replication.html b/content/releases/qpid-cpp-0.34/cpp-broker/book/ha-queue-replication.html deleted file mode 100644 index 783fb2e..0000000 --- a/content/releases/qpid-cpp-0.34/cpp-broker/book/ha-queue-replication.html +++ /dev/null @@ -1,221 +0,0 @@ - - - - - 1.13. Replicating Queues with the HA module - Apache Qpid™ - - - - - - - - - - - - - -
- - - - - - -
- - -
-

1.13. Replicating Queues with the HA module

- As well as support for an active-passive cluster, the - HA module allows you to replicate individual queues, - even if the brokers are not in a cluster. The original - queue is used as normal. The replica queue is - updated automatically as messages are added to or removed from the original - queue. -

Warning

- It is not safe to modify the replica queue - other than via the automatic updates from the original. Adding or removing - messages on the replica queue will make replication inconsistent and may - cause message loss. - The HA module does not enforce - restricted access to the replica queue (as it does in the case of a cluster) - so it is up to the application to ensure the replica is not used until it has - been disconnected from the original. -

1.13.1. Replicating queues

- To create a replica queue, the HA module must be - loaded on both the original and replica brokers (it is loaded by default.) - You also need to set the configuration option: -

-	ha-queue-replication=yes
-      

- to enable this feature on a stand-alone broker. It is automatically - enabled for brokers that are part of a cluster. -

- Suppose that myqueue is a queue on - node1 and we want to create a replica of - myqueue on node2 (where both brokers - are using the default AMQP port.) This is accomplished by the command: -

-	qpid-config --broker=node2 add queue --start-replica node1 myqueue
-      

- If myqueue already exists on the replica - broker you can start replication from the original queue like this: -

-	qpid-ha replicate -b node2 node1 myqueue
-      

-

1.13.2. Replicating queues between clusters

- You can replicate queues between two standalone brokers, between a - standalone broker and a cluster, or between two clusters (see Section 1.12, “Active-Passive Messaging Clusters”.) For failover in a cluster there are two cases to - consider. -

  1. - When the original queue is on the active node - of a cluster, failover is automatic. If the active node - fails, the replication link will automatically reconnect and the - replica will continue to be updated from the new primary. -

  2. - When the replica queue is on the active node of a - cluster, there is no automatic failover. However you can use the - following workaround. -

1.13.2.1. Work around for fail-over of replica queue in a cluster

- When a primary broker fails the cluster resource manager calls a script - to promote a backup broker to be the new primary. By default this script - is /etc/init.d/qpidd-primary but you can modify - that in your cluster.conf file (see Section 1.12.5, “Configuring with rgmanager as resource manager”.) -

- You can modify this script (on each host in your cluster) by adding - commands to create your replica queues just before the broker is - promoted, as indicated in the following exceprt from the script: -

-start() {
-    service qpidd start
-    echo -n $"Promoting qpid daemon to cluster primary: "
-    ################################
-    #### Add your commands here ####
-    ################################
-    $QPID_HA -b localhost:$QPID_PORT promote
-    [ "$?" -eq 0 ] && success || failure
-}
-	

- Your commands will be run, and your replicas created, whenever - the system fails over to a new primary. -

- -
- - - - -
-
-
- - --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@qpid.apache.org For additional commands, e-mail: commits-help@qpid.apache.org