activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clebert Suconic <clebert.suco...@gmail.com>
Subject Re: Problems setting up replicated ha-policy.
Date Mon, 30 Jan 2017 19:16:43 GMT
As Justin pointed out, look at the Network Health Check.  Or to use a
better infra-structure to avoid split brains.

On Mon, Jan 30, 2017 at 11:48 AM, Justin Bertram <jbertram@apache.com> wrote:
>> It does what I think it does, now my slave and my master are active. This however
is acceptable, no problems yet.
>
> Actually, this is a problem.  This is the classic split-brain scenario.  Since both your
master and slave are active with the same messages you will lose data integrity.  Once the
network connection between the live and (now active) backup is restored there is nothing which
can be done to re-integrate the data since there is no way of knowing which broker has the
right data.  This is the risk you run with a single live and backup.  To mitigate the risk
of split-brain you have a couple of options:
>
>   1) Invest in redundant network infrastructure (e.g. multiple NICs on each machine,
redundant network switches, etc.).  Obviously you'll need to perform a cost/risk analysis
here to determine how much your data is actually worth.
>   2) Configure a larger cluster of live/backup pairs so that if a connection between
nodes is lost a quorum vote can (hopefully) prevent the illegitimate activation of a backup.
>   3) Similar to #2 you can use the recently added "network check" functionality [1].
>
>
> Justin
>
>
> [1] http://activemq.apache.org/artemis/docs/1.5.2/network-isolation.html
>
> ----- Original Message -----
> From: "Gerrit Tamboer" <Gerrit.Tamboer@crv4all.com>
> To: users@activemq.apache.org
> Sent: Monday, January 30, 2017 10:03:42 AM
> Subject: Re: Problems setting up replicated ha-policy.
>
> Hi Clebert,
>
> Thanks for pointing me in the right direction, I was able to set up replication with
active/passive failover.
>
> I am able to stop the master or kill the master and the slave is responding to it. If
I start up the master again the slave replicates back to master and the master becomes active.
So far so good.
>
> So what I simulated now is a network outage. I did this by simply making sure that the
master cannot connect to the slave and vice versa (VirtualBox, setting the network adapter
to disabled).
> It does what I think it does, now my slave and my master are active. This however is
acceptable, no problems yet. But when I enable the network adapter again, making sure the
master and slave can connect, it does not do a failback. The slave stays active, as well as
the master, and they don’t seem to communicate. Is this some sort of splitbrain situation?
>
> Regards,
> Gerrit
>
>
> On 27/01/17 21:25, "Clebert Suconic" <clebert.suconic@gmail.com> wrote:
>
> The only issue I found is how you are defining this:
>
> <connector name="localhost">tcp://localhost:61616</connector>
>
> on the cluster connection you are passing localhost as the node, that
> is sent to the backup, backup will try to connect to localhost which
> is itself, so it won't actually connect to the other node.
>
>
> You should pass in a valid IP that will be valid on the second node.
>
>
> Hope this helps...
>
>
> Look at the  examples/features/ha/replicated-failback-static example
>
> On Fri, Jan 27, 2017 at 9:28 AM, Clebert Suconic
> <clebert.suconic@gmail.com> wrote:
>> I won't be able to get to a computer today. Only on Monday.
>>
>>
>> Meanwhile can you compare your config with the replicated examples from the
>> release? That's what I would do anyways.
>>
>>
>> Try with a single live/backup.  Make sure the Id match on the backup so it
>> can pull the data.
>>
>> Let me know how it goes. I may find a time to open a computer this
>> afternoon.
>>
>> On Fri, Jan 27, 2017 at 5:32 AM Gerrit Tamboer <Gerrit.Tamboer@crv4all.com>
>> wrote:
>>>
>>> Hi Clebert,
>>>
>>> Thanks for pointing this out.
>>>
>>> I just tested 1.5.2 but unfortunately the results are exactly the same. No
>>> failover situation although the slave sees the master going down. The slave
>>> does not even notice a master being gone after a kill -9.
>>>
>>> This leads me to believe I have a misconfiguration, because if this is
>>> designed to work like this, it’s not really HA .
>>>
>>> I have added the broker.xml’s of all nodes to this mail again, hopefully
>>> somebody has a simular setup and can verify the configuration.
>>>
>>> Thanks a bunch!
>>>
>>> Regards,
>>> Gerrit Tamboer
>>>
>>>
>>> On 27/01/17 04:33, "Clebert Suconic" <clebert.suconic@gmail.com> wrote:
>>>
>>> Until recently (1.5.0) you would only have the TTL to decide when to
>>> activate backup.
>>>
>>>
>>> Recently connection failures will also play in the decision to activate
>>> it.
>>>
>>>
>>> So on 1.3.0 you will be bound to the TTL of the cluster connection.
>>>
>>>
>>> On 1.5.2 ir should work with kill but you would still be bound to TTL in
>>> case of a cable cut or switch of but that's the deal of tcp-ip
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 26, 2017 at 7:24 AM Gerrit Tambour
>>> <Gerrit.Tamboer@crv4all.com>
>>> wrote:
>>>
>>> > Forgot to send the attachments!
>>> >
>>> >
>>> >
>>> > *From: *Gerrit Tamboer <Gerrit.Tamboer@crv4all.com>
>>> > *Date: *Thursday 26 January 2017 at 13:23
>>> > *To: *"users@activemq.apache.org" <users@activemq.apache.org>
>>> > *Subject *Problems setting up replicated ha-policy.
>>> >
>>> >
>>> >
>>> > Hi community,
>>> >
>>> >
>>> >
>>> > We are attempting to setup a 3 node Artemis (1.3.0) cluster with an
>>> > active-passive failover situation. We see that the master node is
>>> > actively
>>> > accepting connections:
>>> >
>>> >
>>> >
>>> > 09:52:30,167 INFO  [org.apache.activemq.artemis.core.server] AMQ221000:
>>> > live Message Broker is starting with configuration Broker Configuration
>>> > (clustered=true
>>> >
>>> > ,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=/opt/jamq_paging_data/data)
>>> >
>>> > 09:52:33,176 INFO  [org.apache.activemq.artemis.core.server] AMQ221020:
>>> > Started Acceptor at 0.0.0.0:61616 for protocols
>>> > [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
>>> >
>>> >
>>> >
>>> > The slaves are able to connect to the master and are reporting that they
>>> > are in standby mode:
>>> >
>>> >
>>> >
>>> > 08:16:57,426 INFO  [org.apache.activemq.artemis.core.server] AMQ221000:
>>> > backup Message Broker is starting with configuration Broker Configuration
>>> > (clustered=true,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/large-messages,pagingDirectory=/opt/jamq_paging_data/data)
>>> >
>>> > 08:18:38,529 INFO  [org.apache.activemq.artemis.core.server] AMQ221109:
>>> > Apache ActiveMQ Artemis Backup Server version 1.3.0 [null] started, waiting
>>> > live to fail before it gets active
>>> >
>>> >
>>> >
>>> > However, when I kill the master node now, it reports that the master is
>>> > gone , but does not become active itself:
>>> >
>>> >
>>> >
>>> > 08:20:14,987 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
>>> > Connection failure has been detected: AMQ119015: The connection was
>>> > disconnected because of server shutdown [code=DISCONNECTED]
>>> >
>>> >
>>> >
>>> > When I do a kill -9 on the PID of the master java process, it does not
>>> > even report that the master has gone away.
>>> >
>>> > I also tested this in Artemis 1.5.1, with the same results. Also
>>> > removing
>>> > one of the slaves (to have a simple master-slave setup), also does not
>>> > work.
>>> >
>>> > My expectation is that if the master dies, one of the slaves becomes
>>> > active.
>>> >
>>> > Attached you will find the broker.xml of all 3 nodes.
>>> >
>>> >
>>> >
>>> > Thanks in advance for the help!
>>> >
>>> >
>>> >
>>> > Kind regards,
>>> >
>>> > Gerrit Tamboer
>>> >
>>> >
>>> >
>>> >
>>> > This message is subject to the following E-mail Disclaimer. (
>>> > http://www.crv4all.com/disclaimer-email/) CRV Holding B.V. seats
>>> > according to the articles of association in Arnhem, Dutch trade number
>>> > 09125050.
>>> >
>>> --
>>> Clebert Suconic
>>>
>>>
>>> This message is subject to the following E-mail Disclaimer.
>>> (http://www.crv4all.com/disclaimer-email/) CRV Holding B.V. seats according
>>> to the articles of association in Arnhem, Dutch trade number 09125050.
>>
>> --
>> Clebert Suconic
>
>
>
> --
> Clebert Suconic
>
>
> This message is subject to the following E-mail Disclaimer. (http://www.crv4all.com/disclaimer-email/)
CRV Holding B.V. seats according to the articles of association in Arnhem, Dutch trade number
09125050.



-- 
Clebert Suconic

Mime
View raw message