activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clebert Suconic <clebert.suco...@gmail.com>
Subject Re: [Artemis] Master fails to start up after failback
Date Mon, 19 Oct 2015 19:15:08 GMT
Looks related to me:

https://issues.apache.org/jira/browse/ARTEMIS-256



On Mon, Oct 19, 2015 at 4:04 AM, Mihkel Nõges
<mihkel.noges@transferwise.com> wrote:
> Basic flow of getting unresponsive failback cluster:
> Have machine with Ubuntu 14.04.3
>
>    1. Install libaio1, Java 1.8.0_60, maven 3.3.3, download and extract
>    apache-artemis-1.1.0-bin
>    <http://www.eu.apache.org/dist/activemq/activemq-artemis/1.1.0/apache-artemis-1.1.0-bin.tar.gz>
> in
>    /opt
>    2. run $ mvn -Prelease install and $ mnv verify in
>       /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback
>    SUCCESS
>    3. Clean data folders and starts both servers manually:
>    $
>    cd  /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target
>    $ rm -R server0/data/
>    $ rm -R server1/data/
>    $ server0/bin/artemis-service start
>    Starting artemis-service
>    artemis-service is now running (7154)
>    $ server1/bin/artemis-service start
>    Starting artemis-service
>    artemis-service is now running (7180)
>    4. Kill master server and wait for slave to take over
>    $ kill -9 7154
>
>    $ tail -f server1/log/artemis.log
>    08:52:54,798 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
>    Protocol module found: [artemis-stomp-protocol]. Adding protocol support
>    for: STOMP
>    08:53:02,145 INFO  [org.apache.activemq.artemis.core.server] AMQ221109:
>    Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started, waiting
>    live to fail before it gets active
>    08:53:03,582 INFO  [org.apache.activemq.artemis.core.server] AMQ221024:
>    Backup server
>    ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is
>    synchronized with live-server.
>    08:53:03,777 INFO  [org.apache.activemq.artemis.core.server] AMQ221031:
>    backup announced
>    08:55:59,292 INFO  [org.apache.activemq.artemis.core.server] AMQ221037:
>    ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to
>    become 'live'
>    08:55:59,302 WARN  [org.apache.activemq.artemis.core.client] AMQ212004:
>    Failed to connect to server.
>    08:55:59,778 INFO  [org.apache.activemq.artemis.core.server] AMQ221003:
>    trying to deploy queue jms.queue.exampleQueue
>    08:55:59,829 WARN  [org.apache.activemq.artemis.core.client] AMQ212034:
>    There are more than one servers on the network broadcasting the same node
>    id. You will see this message exactly once (per node) if a node is
>    restarted, in which case it can be safely ignored. But if it is logged
>    continuously it means you really do have more than one node on the same
>    network active concurrently with the same node id. This could occur if you
>    have a backup node active at the same time as its live node.
>    nodeID=64ddff0f-7636-11e5-bfa8-f5004e6195f0
>    08:55:59,836 INFO  [org.apache.activemq.artemis.core.server] AMQ221007:
>    Server is now live
>    08:55:59,869 INFO  [org.apache.activemq.artemis.core.server] AMQ221020:
>    Started Acceptor at broker3:61617 for protocols
>    [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE]
>    5.
>
>    Start master again and observer the logs:
>    $ server0/bin/artemis-service start
>    Starting artemis-service
>    artemis-service is now running (7388)
>
> $ tail -f server0/log/artemis.log
> 08:57:24,625 INFO  [org.apache.activemq.artemis.core.server] AMQ221012:
> Using AIO Journal
> 08:57:24,694 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-server]. Adding protocol support for: CORE
> 08:57:24,702 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-amqp-protocol]. Adding protocol support
> for: AMQP
> 08:57:24,731 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-hornetq-protocol]. Adding protocol support
> for: HORNETQ
> 08:57:24,733 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-mqtt-protocol]. Adding protocol support
> for: MQTT
> 08:57:24,743 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-openwire-protocol]. Adding protocol support
> for: OPENWIRE
> 08:57:24,878 INFO  [org.apache.activemq.artemis.core.server] AMQ221043:
> Protocol module found: [artemis-stomp-protocol]. Adding protocol support
> for: STOMP
> 08:57:25,082 INFO  [org.apache.activemq.artemis.core.server] AMQ221109:
> Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] started, waiting
> live to fail before it gets active
> 08:57:27,043 INFO  [org.apache.activemq.artemis.core.server] AMQ221024:
> Backup server
> ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 is
> synchronized with live-server.
> 08:57:27,948 INFO  [org.apache.activemq.artemis.core.server] AMQ221031:
> backup announced
> 08:57:31,227 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
> Connection failure has been detected: AMQ119015: The connection was
> disconnected because of server shutdown [code=DISCONNECTED]
> 08:57:31,252 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
> Connection failure has been detected: AMQ119015: The connection was
> disconnected because of server shutdown [code=DISCONNECTED]
> 08:57:31,307 WARN  [org.apache.activemq.artemis.core.client] AMQ212037:
> Connection failure has been detected: AMQ119015: The connection was
> disconnected because of server shutdown [code=DISCONNECTED]
> 08:57:31,339 INFO  [org.apache.activemq.artemis.core.server] AMQ221037:
> ActiveMQServerImpl::serverUUID=64ddff0f-7636-11e5-bfa8-f5004e6195f0 to
> become 'live'
> 08:57:31,360 WARN  [org.apache.activemq.artemis.core.client] AMQ212004:
> Failed to connect to server.
> 08:57:31,413 ERROR [org.apache.activemq.artemis.core.server] AMQ224008:
> Failed to store id: java.lang.IllegalStateException: Cannot find add info 1
> at
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
> [artemis-journal-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
> [artemis-journal-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
> [artemis-journal-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:1784)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:1625)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:1535)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
> [artemis-server-1.1.0.jar:1.1.0]
> at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
> 08:57:31,540 WARN  [org.apache.activemq.artemis.core.server] AMQ222173:
> Queue jms.queue.exampleQueue is duplicated during reload. This queue will
> be renamed as jms.queue.exampleQueue-0
> 08:57:31,550 ERROR [org.apache.activemq.artemis.core.server] AMQ224000:
> Failure in initialisation: java.lang.IllegalStateException: Cursor 2 had
> already been created
> at
> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:1631)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:1535)
> [artemis-server-1.1.0.jar:1.1.0]
> at
> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
> [artemis-server-1.1.0.jar:1.1.0]
> at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>
>
> On 19 October 2015 at 10:31, Mihkel Nõges <mihkel.noges@transferwise.com>
> wrote:
>
>> Hi Clebert,
>>
>> I do not have other code to share with you but the example code in Artemis
>> 1.1.0 binary deployment package. I'm running
>> org.apache.activemq.artemis.jms.example.ReplicatedFailbackExample
>>
>> And only commented out the serverStart and killServer calls which I am
>> doing manually.
>>
>> I do not think I do any of the steps too fast as I tail the server log
>> files in parallel and see everything is finished when I start the fail
>> back. I have waited many minutes in between.
>>
>> Only changes in configuration to the test is changing localhost addresses
>> with broker3 to make the cluster accessible remotely.
>>
>> BR!
>> MIhkel
>>
>> On 18 October 2015 at 17:49, Clebert <clebert.suconic@gmail.com> wrote:
>>
>>> Im not on my computer now but it sounds like you are doing a fail back
>>> immediately after failed over. It takes some time (seconds) to the server
>>> to activate on the backup.
>>>
>>> Later the server will need to copy the data back before it can be
>>> activated in fail back mode.
>>>
>>> It sounds the live is not reaching backup for fail back.
>>>
>>> I will try looking it at it on Monday. Maybe you could post your example
>>> at your GitHub fork.
>>>
>>> -- Clebert Suconic typing on the iPhone.
>>>
>>> > On Oct 18, 2015, at 08:15, Mihkel Nõges <mihkel.noges@transferwise.com>
>>> wrote:
>>> >
>>> > Hello again!
>>> >
>>> > I would be very grateful If someone could answer my questions. We need
>>> the high availability to work to use the broker in production.
>>> >
>>> > When I run the replicated-failback example in one machine (broker3) it
>>> succeeds.
>>> >
>>> > It fails when I run the same test - exactly the same servers with
>>> slightly modified client remotely.
>>> >
>>> > I run client in debug mode from my IDE with commented out serverStart
>>> and killServer calls.
>>> > Deleted data folders and started the servers:
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> rm -R server0/data/
>>> >
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> rm -R server1/data/
>>> >
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> server0/bin/artemis-service start
>>> >
>>> > Starting artemis-service
>>> >
>>> > artemis-service is now running (23357)
>>> >
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> server1/bin/artemis-service start
>>> >
>>> > Starting artemis-service
>>> >
>>> > artemis-service is now running (23383)
>>> >
>>> > Starting client and stopping on breakpoint at line 103:
>>> > //ServerUtil.killServer(server0);
>>> > // Step 11. Acknowledging the 2nd half of the sent messages will fail
>>> as failover to the
>>> > // backup server has occurred
>>> > try {
>>> >    message0.acknowledge();  //line 103
>>> > killing server0
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> kill -9 23357
>>> >
>>> > Proceeding to breakpoint at line 121:
>>> > //server0 = ServerUtil.startServer(args[0],
>>> ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000);
>>> >
>>> > // Step 11. Acknowledging the 2nd half of the sent messages will fail
>>> as failover to the
>>> > // backup server has occurred
>>> > try {
>>> >    message0.acknowledge(); // line 121
>>> > Starting server0:
>>> > artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
>>> server0/bin/artemis-service start
>>> >
>>> > Starting artemis-service
>>> >
>>> > artemis-service is now running (24240)
>>> >
>>> > Server0 writes ERROR to it's log (see attached server0_artemis.log).
>>> > Now when trying to proceed with the client it writes the following in
>>> the log and does not exit, but remains hanging forever:
>>> >
>>> > Oct 18, 2015 2:55:34 PM
>>> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl
>>> fail
>>> >
>>> > WARN: AMQ212037: Connection failure has been detected: AMQ119015: The
>>> connection was disconnected because of server shutdown [code=DISCONNECTED]
>>> >
>>> > Got message: This is text message 20 (redelivered?: false)
>>> >
>>> > Got exception while acknowledging message: AMQ119014: Timed out after
>>> waiting 30,000 ms for response when sending packet 43
>>> >
>>> > Got message: This is text message 21 (redelivered?: false)
>>> >
>>> > Got message: This is text message 22 (redelivered?: false)
>>> >
>>> > Got message: This is text message 23 (redelivered?: false)
>>> >
>>> > Got message: This is text message 24 (redelivered?: false)
>>> >
>>> > Got message: This is text message 25 (redelivered?: false)
>>> >
>>> > Got message: This is text message 26 (redelivered?: false)
>>> >
>>> > Got message: This is text message 27 (redelivered?: false)
>>> >
>>> > Got message: This is text message 28 (redelivered?: false)
>>> >
>>> > Got message: This is text message 29 (redelivered?: false)
>>> >
>>> > As a result the slave (server1) remains stopped, not restarted as
>>> expected and the master (server0) process appears to be running but does
>>> not accept any connections.
>>> >
>>> > Exactly the same behavior is observable every time I try this.
>>> >
>>> > BR!
>>> > Mihkel
>>> >
>>> >> On 13 October 2015 at 20:17, Mihkel Nõges <
>>> mihkel.noges@transferwise.com> wrote:
>>> >> Hi Clebert,
>>> >>
>>> >> No test, just doing it on command line with standalone servers. I'm
>>> using 1.1.0 installed with wget, not the snapshot.
>>> >>
>>> >> I'm wondering what should be the suggested procedure for admins to do
>>> changes to HA cluster of 2 or 3 nodes of Artemis. If one of the nodes is
>>> master by configuration, do they need to change it's config before
>>> restarting it to become slave to have seamless change process and make some
>>> instance master by configuration only if all the instances need to be
>>> restarted?
>>> >>
>>> >> I tried also a cluster with 2 masters and 2 slaves with 2 separate
>>> group-name values, but for some reason the second master I started became
>>> slave for the first immediately. I expected it to become a clustered
>>> load-balancing parallel master. Our loads are not yet that high to require
>>> more than one master, so it's just a theoretical question.
>>> >>
>>> >> BR!
>>> >> Mihkel
>>> >>
>>> >>> On 13 October 2015 at 20:03, Clebert Suconic <
>>> clebert.suconic@gmail.com> wrote:
>>> >>> The master needs to copy its data from the backup back to live before
>>> >>> it's activated.
>>> >>>
>>> >>> Do you have a test replicating this?
>>> >>>
>>> >>> Did you try the snapshot build?
>>> >>>
>>> >>> On Tue, Oct 13, 2015 at 11:58 AM, Mihkel Nõges
>>> >>> <mihkel.noges@transferwise.com> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > I configured replicating HA master-slave of Artemis 1.1.0 instances
>>> on
>>> >>> > Ubuntu 14.04.3.
>>> >>> >
>>> >>> > When I kill master the slave takes over as expected and starts
>>> serving as
>>> >>> > new master. When I then start the old master, it fails with
the
>>> following
>>> >>> > errors in the log:
>>> >>> >
>>> >>> > 16:35:46,476 ERROR [org.apache.activemq.artemis.core.server]
>>> AMQ224008:
>>> >>> > Failed to store id: java.lang.IllegalStateException: Cannot
find
>>> add info 1
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
>>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
>>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
>>> >>> > [artemis-journal-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
>>> 1784)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> 1625)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> 1535)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >>> >
>>> >>> > 16:35:46,572 WARN  [org.apache.activemq.artemis.core.server]
>>> AMQ222173:
>>> >>> > Queue jms.queue.DLQ is duplicated during reload. This queue
will be
>>> renamed
>>> >>> > as jms.queue.DLQ-0
>>> >>> > 16:35:46,572 ERROR [org.apache.activemq.artemis.core.server]
>>> AMQ224000:
>>> >>> > Failure in initialisation: java.lang.IllegalStateException:
Cursor
>>> 2 had
>>> >>> > already been created
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>>> 1631)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>>> 1535)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at
>>> >>> >
>>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>>> >>> > [artemis-server-1.1.0.jar:1.1.0]
>>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>>> >>> >
>>> >>> > As a result both master and the slave remain unaccessible and
no
>>> further
>>> >>> > restarts solve the situation.
>>> >>> >
>>> >>> > Attached also master and slave broker.xml files.
>>> >>> >
>>> >>> > BR!
>>> >>> >
>>> >>> > Mihkel Nõges
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Clebert Suconic
>>> >
>>>
>>
>>



-- 
Clebert Suconic

Mime
View raw message