activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mihkel Nõges <mihkel.no...@transferwise.com>
Subject Re: [Artemis] Master fails to start up after failback
Date Sun, 18 Oct 2015 12:15:55 GMT
Hello again!

I would be very grateful If someone could answer my questions. We need the
high availability to work to use the broker in production.

When I run the replicated-failback example in one machine (broker3) it
succeeds.

It fails when I run the same test - exactly the same servers with slightly
modified client remotely.

I run client in debug mode from my IDE with commented out serverStart and
killServer calls.

   1. Deleted data folders and started the servers:

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   rm -R server0/data/

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   rm -R server1/data/

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   server0/bin/artemis-service start

   Starting artemis-service

   artemis-service is now running (23357)

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   server1/bin/artemis-service start

   Starting artemis-service

   artemis-service is now running (23383)
   2. Starting client and stopping on breakpoint at line 103:

   //ServerUtil.killServer(server0);

   // Step 11. Acknowledging the 2nd half of the sent messages will fail as
   failover to the
   // backup server has occurred
   try {

      message0.acknowledge();  //line 103
   3. killing server0

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   kill -9 23357
   4. Proceeding to breakpoint at line 121:

   //server0 = ServerUtil.startServer(args[0],
ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000);

   // Step 11. Acknowledging the 2nd half of the sent messages will
fail as failover to the
   // backup server has occurred
   try {
      message0.acknowledge(); // line 121

   5. Starting server0:

   artemis@broker3:/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target$
   server0/bin/artemis-service start

   Starting artemis-service

   artemis-service is now running (24240)
   6. Server0 writes ERROR to it's log (see attached server0_artemis.log).
   7.

   Now when trying to proceed with the client it writes the following in
   the log and does not exit, but remains hanging forever:

   Oct 18, 2015 2:55:34 PM
   org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl
   fail

   WARN: AMQ212037: Connection failure has been detected: AMQ119015: The
   connection was disconnected because of server shutdown [code=DISCONNECTED]

   Got message: This is text message 20 (redelivered?: false)

   Got exception while acknowledging message: AMQ119014: Timed out after
   waiting 30,000 ms for response when sending packet 43

   Got message: This is text message 21 (redelivered?: false)

   Got message: This is text message 22 (redelivered?: false)

   Got message: This is text message 23 (redelivered?: false)

   Got message: This is text message 24 (redelivered?: false)

   Got message: This is text message 25 (redelivered?: false)

   Got message: This is text message 26 (redelivered?: false)

   Got message: This is text message 27 (redelivered?: false)

   Got message: This is text message 28 (redelivered?: false)

   Got message: This is text message 29 (redelivered?: false)
   8.

   As a result the slave (server1) remains stopped, not restarted as
   expected and the master (server0) process appears to be running but does
   not accept any connections.

Exactly the same behavior is observable every time I try this.

BR!
Mihkel

On 13 October 2015 at 20:17, Mihkel Nõges <mihkel.noges@transferwise.com>
wrote:

> Hi Clebert,
>
> No test, just doing it on command line with standalone servers. I'm using
> 1.1.0 installed with wget, not the snapshot.
>
> I'm wondering what should be the suggested procedure for admins to do
> changes to HA cluster of 2 or 3 nodes of Artemis. If one of the nodes is
> master by configuration, do they need to change it's config before
> restarting it to become slave to have seamless change process and make some
> instance master by configuration only if all the instances need to be
> restarted?
>
> I tried also a cluster with 2 masters and 2 slaves with 2 separate
> group-name values, but for some reason the second master I started became
> slave for the first immediately. I expected it to become a clustered
> load-balancing parallel master. Our loads are not yet that high to require
> more than one master, so it's just a theoretical question.
>
> BR!
> Mihkel
>
> On 13 October 2015 at 20:03, Clebert Suconic <clebert.suconic@gmail.com>
> wrote:
>
>> The master needs to copy its data from the backup back to live before
>> it's activated.
>>
>> Do you have a test replicating this?
>>
>> Did you try the snapshot build?
>>
>> On Tue, Oct 13, 2015 at 11:58 AM, Mihkel Nõges
>> <mihkel.noges@transferwise.com> wrote:
>> > Hi,
>> >
>> > I configured replicating HA master-slave of Artemis 1.1.0 instances on
>> > Ubuntu 14.04.3.
>> >
>> > When I kill master the slave takes over as expected and starts serving
>> as
>> > new master. When I then start the old master, it fails with the
>> following
>> > errors in the log:
>> >
>> > 16:35:46,476 ERROR [org.apache.activemq.artemis.core.server] AMQ224008:
>> > Failed to store id: java.lang.IllegalStateException: Cannot find add
>> info 1
>> > at
>> >
>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:799)
>> > [artemis-journal-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRecord(JournalBase.java:183)
>> > [artemis-journal-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:79)
>> > [artemis-journal-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.deleteID(JournalStorageManager.java:1194)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.deleteID(BatchingIDGenerator.java:152)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGenerator.cleanup(BatchingIDGenerator.java:75)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.loadBindingJournal(JournalStorageManager.java:
>> 1784)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>> 1625)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>> 1535)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>> >
>> > 16:35:46,572 WARN  [org.apache.activemq.artemis.core.server] AMQ222173:
>> > Queue jms.queue.DLQ is duplicated during reload. This queue will be
>> renamed
>> > as jms.queue.DLQ-0
>> > 16:35:46,572 ERROR [org.apache.activemq.artemis.core.server] AMQ224000:
>> > Failure in initialisation: java.lang.IllegalStateException: Cursor 2 had
>> > already been created
>> > at
>> >
>> org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.createSubscription(PageCursorProviderImpl.java:97)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.initQueues(PostOfficeJournalLoader.java:140)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJournals(ActiveMQServerImpl.java:
>> 1631)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialisePart2(ActiveMQServerImpl.java:
>> 1535)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at
>> >
>> org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:249)
>> > [artemis-server-1.1.0.jar:1.1.0]
>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60]
>> >
>> > As a result both master and the slave remain unaccessible and no further
>> > restarts solve the situation.
>> >
>> > Attached also master and slave broker.xml files.
>> >
>> > BR!
>> >
>> > Mihkel Nõges
>>
>>
>>
>> --
>> Clebert Suconic
>>
>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message