Return-Path: X-Original-To: apmail-activemq-users-archive@www.apache.org Delivered-To: apmail-activemq-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C06F18444 for ; Tue, 20 Oct 2015 15:36:38 +0000 (UTC) Received: (qmail 87514 invoked by uid 500); 20 Oct 2015 15:36:31 -0000 Delivered-To: apmail-activemq-users-archive@activemq.apache.org Received: (qmail 87475 invoked by uid 500); 20 Oct 2015 15:36:31 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Received: (qmail 87463 invoked by uid 99); 20 Oct 2015 15:36:31 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Oct 2015 15:36:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id AC3FC1A293D for ; Tue, 20 Oct 2015 15:36:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.003 X-Spam-Level: X-Spam-Status: No, score=-0.003 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.008, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id xQT70TjtrRgG for ; Tue, 20 Oct 2015 15:36:21 +0000 (UTC) Received: from mx3-phx2.redhat.com (mx3-phx2.redhat.com [209.132.183.24]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 05D16439E9 for ; Tue, 20 Oct 2015 15:36:20 +0000 (UTC) Received: from zmail09.collab.prod.int.phx2.redhat.com (zmail09.collab.prod.int.phx2.redhat.com [10.5.83.11]) by mx3-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t9KFaE2L029620 for ; Tue, 20 Oct 2015 11:36:14 -0400 Date: Tue, 20 Oct 2015 11:36:14 -0400 (EDT) From: Justin Bertram To: users@activemq.apache.org Message-ID: <1925883774.33997882.1445355374142.JavaMail.zimbra@redhat.com> In-Reply-To: References: <160501601.33974642.1445352255205.JavaMail.zimbra@redhat.com> Subject: Re: [Artemis] Master fails to start up after failback MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [10.10.53.211] X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - GC45 (Linux)/8.0.6_GA_5922) Thread-Topic: Master fails to start up after failback Thread-Index: 6rGV426tyr3izgLlPiBfXZt/bCglHw== I'm not sure I understand the point of having an HA policy without any HA (= i.e. without any backups). If you want 2 master servers then don't configu= re HA, just configure 2 clustered servers. Also, make sure you don't copy the journal from one server to another when = configuring replication as the journal contains the unique ID of each node. Justin ----- Original Message ----- From: "Mihkel N=C3=B5ges" To: users@activemq.apache.org Sent: Tuesday, October 20, 2015 9:46:24 AM Subject: Re: [Artemis] Master fails to start up after failback Yes, sorry, had typo in email. Had replicated conf and second master became slave for the first. Mihkel On 20 October 2015 at 17:44, Justin Bertram wrote: > You can't have 2 masters using the same shared-store. However, you can > have 2 masters each with their own store. > > > Justin > > ----- Original Message ----- > From: "Mihkel N=C3=B5ges" > To: users@activemq.apache.org > Sent: Tuesday, October 20, 2015 9:24:21 AM > Subject: Re: [Artemis] Master fails to start up after failback > > Also I had a question earlier about having more than one Artemis master i= n > single cluster. When I tried this it resulted in only one master becoming= a > master, the other one became a slave for the first one started even thoug= h > I set different group-name values for them in broker.xml. Is this expecte= d > behavior? > > > > > ha-cluster1 > > > > > > > > ha-cluster2 > > > > > Mihkel > > On 20 October 2015 at 16:53, Mihkel N=C3=B5ges > wrote: > > > Hi Tim, Clebert! > > > > Yes we considered also the alternatives ( > > http://activemq.apache.org/masterslave.html): > > *Shared Storage:* > > > > We do not have high performance shared storage solution. We have some > > solution for our current file storage needs, but it's I/O is said to be > > very slow and would need to be extended to support extra load. > > > > *Replicated LevelDB:* > > > > It sounds cool, but I'm a little bit afraid of moving from one > > experimental solution to the next. I noticed LevelDB does not support > some > > of the features we need like Scheduled message delivery: > > http://activemq.apache.org/replicated-leveldb-store.html > > The LevelDB store does not yet support storing data associated with Del= ay > > and Schedule Message Delivery. Those are are stored in a separate > > non-replicated KahaDB data files. Unexpected results will occur if you > use > > Delay and Schedule Message Delivery with the replicated leveldb store > since > > that data will be not be there when the master fails over to a slave. > > > > Note like this make me feel very uneasy about the solution. > > > > *JDBC:* > > > > So it seems to me like the most reliable highly available messaging > > solution in ActiveMQ 5 is JDBC. We have MySQL running as our main DB an= d > > setting up a second DB for messaging would be fairly simple for standar= d > > procedures of maintenance, backups and disaster recovery etc. > > > > > > I consider this only as a temporary solution until we can use more > > performant alternative configuration and I'm not expecting Artemis to > > implement support for JDBC storage ever. > > > > We are using messaging in process of splitting our monolithic applicati= on > > into micro-services. As this is gradual process, the amount of messages > > would be very small in the beginning, so having low performing but > reliable > > JDBC backed broker configuration seems good for start. > > > > I was trying to find the more orthodox approach, but could not find or > get > > good suggestions. I tried disabling fail-back and starting master like > that > > resulted in both servers spamming in the logs another server with the > same > > ID is running. Do I understand correctly I should have backed up and > > removed the /data folder of the master, reconfigured it as a slave and > > started it then? > > > > Can you give me some overview of already existing deployments of highly > > available and failing over (not necessarily failing back) Artemis > > installations in production I may change my mind about going with it fr= om > > the start. > > > > Mihkel > > > > > > On 20 October 2015 at 16:19, Clebert Suconic > > wrote: > > > >> As far as I know ActiveMQ5 doesn't do failback on the master-slave > >> journal... and it doesn't have any protocol to sync the data between > >> master and slave. > >> > >> > >> There is a small regression on the failback that we are dealing now... > >> if you set the master as a backup it would work fine... > >> > >> > >> I think your testcase is a bit non orthodox... > >> > >> TBH production guys usually don't use failback.. they keep the backup > >> until they can get to a quiet period and then do the failback (or > >> restart the system) under low load. > >> > >> > >> I also second Tim Bain on your choice for JDBC. > >> > >> I actually always say this.. if you can use JDBC as a storage for > >> messaging.. don't use messaging at all.. just store and retrieve from > >> the Database. > >> > >> > >> There's a JIRA open for Artemis on JDBC.. but usually those things are > >> written because users want, not need it. > >> > >> On Tue, Oct 20, 2015 at 3:12 AM, Mihkel N=C3=B5ges > >> wrote: > >> > Yes I saw that issue too and set myself as watcher of this when it w= as > >> > created. I did not think it could be exactly the same as it is > >> described to > >> > present itself only in narrow timing related conditions. My case see= ms > >> to > >> > be much more broad and basic. Seems like nobody actually tried to se= t > >> this > >> > up in realistic situation. > >> > > >> > Do you know of any existing production deployments of Artemis (or > >> hornetq) > >> > with failover? I thought Artemis as based on hornetq should have it= 's > >> > features as stable as last hornetq version. We have already used > >> embedded > >> > hornetq for some time happily. I think it would make a lot of sense = to > >> > grade the Artemis features publicly as what is their maturity and > usage > >> > statistics of each feature if known, so it would be easier to compar= e > >> the > >> > brokers even among the 3 variants of ActiveMQ family. > >> > > >> > I think it's more safe for us to start building our first messaging > >> > features on ActiveMQ 5.12.1 with JDBC backed Master-Slave instead of > >> > Artemis and switch to Artemis once it has become more stable and als= o > >> our > >> > needs for scalability have grown to make it reasonable. Right now it > >> seems > >> > there are still too big blockers which may threaten the stability of > our > >> > system in Artemis. > >> > > >> > I did not mean this letter to be in no means negative. In the > opposite I > >> > really hope Artemis would do well as it comes with such a great > >> technical > >> > foundation and elegant ideas. I think the best for Artemis would be = to > >> find > >> > users that can trust it's features and improve it as they grow. This > >> means > >> > the nucleus of Artemis must be really solid and stable. > >> > > >> > BR! > >> > Mihkel N=C3=B5ges > >> > > >> > > >> > > >> > On 19 October 2015 at 22:15, Clebert Suconic < > clebert.suconic@gmail.com > >> > > >> > wrote: > >> > > >> >> Looks related to me: > >> >> > >> >> https://issues.apache.org/jira/browse/ARTEMIS-256 > >> >> > >> >> > >> >> > >> >> On Mon, Oct 19, 2015 at 4:04 AM, Mihkel N=C3=B5ges > >> >> wrote: > >> >> > Basic flow of getting unresponsive failback cluster: > >> >> > Have machine with Ubuntu 14.04.3 > >> >> > > >> >> > 1. Install libaio1, Java 1.8.0_60, maven 3.3.3, download and > >> extract > >> >> > apache-artemis-1.1.0-bin > >> >> > < > >> >> > >> > http://www.eu.apache.org/dist/activemq/activemq-artemis/1.1.0/apache-arte= mis-1.1.0-bin.tar.gz > >> >> > > >> >> > in > >> >> > /opt > >> >> > 2. run $ mvn -Prelease install and $ mnv verify in > >> >> > > >> /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback > >> >> > SUCCESS > >> >> > 3. Clean data folders and starts both servers manually: > >> >> > $ > >> >> > cd > >> >> > >> > /opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/target > >> >> > $ rm -R server0/data/ > >> >> > $ rm -R server1/data/ > >> >> > $ server0/bin/artemis-service start > >> >> > Starting artemis-service > >> >> > artemis-service is now running (7154) > >> >> > $ server1/bin/artemis-service start > >> >> > Starting artemis-service > >> >> > artemis-service is now running (7180) > >> >> > 4. Kill master server and wait for slave to take over > >> >> > $ kill -9 7154 > >> >> > > >> >> > $ tail -f server1/log/artemis.log > >> >> > 08:52:54,798 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221043: > >> >> > Protocol module found: [artemis-stomp-protocol]. Adding protoc= ol > >> >> support > >> >> > for: STOMP > >> >> > 08:53:02,145 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221109: > >> >> > Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] > >> started, > >> >> waiting > >> >> > live to fail before it gets active > >> >> > 08:53:03,582 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221024: > >> >> > Backup server > >> >> > > >> ActiveMQServerImpl::serverUUID=3D64ddff0f-7636-11e5-bfa8-f5004e6195f0 = is > >> >> > synchronized with live-server. > >> >> > 08:53:03,777 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221031: > >> >> > backup announced > >> >> > 08:55:59,292 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221037: > >> >> > > >> ActiveMQServerImpl::serverUUID=3D64ddff0f-7636-11e5-bfa8-f5004e6195f0 = to > >> >> > become 'live' > >> >> > 08:55:59,302 WARN [org.apache.activemq.artemis.core.client] > >> >> AMQ212004: > >> >> > Failed to connect to server. > >> >> > 08:55:59,778 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221003: > >> >> > trying to deploy queue jms.queue.exampleQueue > >> >> > 08:55:59,829 WARN [org.apache.activemq.artemis.core.client] > >> >> AMQ212034: > >> >> > There are more than one servers on the network broadcasting th= e > >> same > >> >> node > >> >> > id. You will see this message exactly once (per node) if a nod= e > is > >> >> > restarted, in which case it can be safely ignored. But if it i= s > >> logged > >> >> > continuously it means you really do have more than one node on > the > >> >> same > >> >> > network active concurrently with the same node id. This could > >> occur > >> >> if you > >> >> > have a backup node active at the same time as its live node. > >> >> > nodeID=3D64ddff0f-7636-11e5-bfa8-f5004e6195f0 > >> >> > 08:55:59,836 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221007: > >> >> > Server is now live > >> >> > 08:55:59,869 INFO [org.apache.activemq.artemis.core.server] > >> >> AMQ221020: > >> >> > Started Acceptor at broker3:61617 for protocols > >> >> > [CORE,MQTT,AMQP,HORNETQ,STOMP,OPENWIRE] > >> >> > 5. > >> >> > > >> >> > Start master again and observer the logs: > >> >> > $ server0/bin/artemis-service start > >> >> > Starting artemis-service > >> >> > artemis-service is now running (7388) > >> >> > > >> >> > $ tail -f server0/log/artemis.log > >> >> > 08:57:24,625 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221012: > >> >> > Using AIO Journal > >> >> > 08:57:24,694 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-server]. Adding protocol support > for: > >> >> CORE > >> >> > 08:57:24,702 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-amqp-protocol]. Adding protocol > >> support > >> >> > for: AMQP > >> >> > 08:57:24,731 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-hornetq-protocol]. Adding protoco= l > >> >> support > >> >> > for: HORNETQ > >> >> > 08:57:24,733 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-mqtt-protocol]. Adding protocol > >> support > >> >> > for: MQTT > >> >> > 08:57:24,743 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-openwire-protocol]. Adding protoc= ol > >> >> support > >> >> > for: OPENWIRE > >> >> > 08:57:24,878 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221043: > >> >> > Protocol module found: [artemis-stomp-protocol]. Adding protocol > >> support > >> >> > for: STOMP > >> >> > 08:57:25,082 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221109: > >> >> > Apache ActiveMQ Artemis Backup Server version 1.1.0 [null] starte= d, > >> >> waiting > >> >> > live to fail before it gets active > >> >> > 08:57:27,043 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221024: > >> >> > Backup server > >> >> > ActiveMQServerImpl::serverUUID=3D64ddff0f-7636-11e5-bfa8-f5004e61= 95f0 > >> is > >> >> > synchronized with live-server. > >> >> > 08:57:27,948 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221031: > >> >> > backup announced > >> >> > 08:57:31,227 WARN [org.apache.activemq.artemis.core.client] > >> AMQ212037: > >> >> > Connection failure has been detected: AMQ119015: The connection w= as > >> >> > disconnected because of server shutdown [code=3DDISCONNECTED] > >> >> > 08:57:31,252 WARN [org.apache.activemq.artemis.core.client] > >> AMQ212037: > >> >> > Connection failure has been detected: AMQ119015: The connection w= as > >> >> > disconnected because of server shutdown [code=3DDISCONNECTED] > >> >> > 08:57:31,307 WARN [org.apache.activemq.artemis.core.client] > >> AMQ212037: > >> >> > Connection failure has been detected: AMQ119015: The connection w= as > >> >> > disconnected because of server shutdown [code=3DDISCONNECTED] > >> >> > 08:57:31,339 INFO [org.apache.activemq.artemis.core.server] > >> AMQ221037: > >> >> > ActiveMQServerImpl::serverUUID=3D64ddff0f-7636-11e5-bfa8-f5004e61= 95f0 > >> to > >> >> > become 'live' > >> >> > 08:57:31,360 WARN [org.apache.activemq.artemis.core.client] > >> AMQ212004: > >> >> > Failed to connect to server. > >> >> > 08:57:31,413 ERROR [org.apache.activemq.artemis.core.server] > >> AMQ224008: > >> >> > Failed to store id: java.lang.IllegalStateException: Cannot find > add > >> >> info 1 > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRec= ord(JournalImpl.java:799) > >> >> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRec= ord(JournalBase.java:183) > >> >> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRec= ord(JournalImpl.java:79) > >> >> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageM= anager.deleteID(JournalStorageManager.java:1194) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGener= ator.deleteID(BatchingIDGenerator.java:152) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGener= ator.cleanup(BatchingIDGenerator.java:75) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageM= anager.loadBindingJournal(JournalStorageManager.java: > >> 1784) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJourn= als(ActiveMQServerImpl.java: > >> 1625) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialis= ePart2(ActiveMQServerImpl.java: > >> 1535) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivatio= n.run(SharedNothingBackupActivation.java:249) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] > >> >> > 08:57:31,540 WARN [org.apache.activemq.artemis.core.server] > >> AMQ222173: > >> >> > Queue jms.queue.exampleQueue is duplicated during reload. This > queue > >> will > >> >> > be renamed as jms.queue.exampleQueue-0 > >> >> > 08:57:31,550 ERROR [org.apache.activemq.artemis.core.server] > >> AMQ224000: > >> >> > Failure in initialisation: java.lang.IllegalStateException: Curso= r > 2 > >> had > >> >> > already been created > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImp= l.createSubscription(PageCursorProviderImpl.java:97) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.init= Queues(PostOfficeJournalLoader.java:140) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJourn= als(ActiveMQServerImpl.java: > >> 1631) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialis= ePart2(ActiveMQServerImpl.java: > >> 1535) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at > >> >> > > >> >> > >> > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivatio= n.run(SharedNothingBackupActivation.java:249) > >> >> > [artemis-server-1.1.0.jar:1.1.0] > >> >> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60] > >> >> > > >> >> > > >> >> > On 19 October 2015 at 10:31, Mihkel N=C3=B5ges < > >> mihkel.noges@transferwise.com > >> >> > > >> >> > wrote: > >> >> > > >> >> >> Hi Clebert, > >> >> >> > >> >> >> I do not have other code to share with you but the example code = in > >> >> Artemis > >> >> >> 1.1.0 binary deployment package. I'm running > >> >> >> org.apache.activemq.artemis.jms.example.ReplicatedFailbackExampl= e > >> >> >> > >> >> >> And only commented out the serverStart and killServer calls whic= h > I > >> am > >> >> >> doing manually. > >> >> >> > >> >> >> I do not think I do any of the steps too fast as I tail the serv= er > >> log > >> >> >> files in parallel and see everything is finished when I start th= e > >> fail > >> >> >> back. I have waited many minutes in between. > >> >> >> > >> >> >> Only changes in configuration to the test is changing localhost > >> >> addresses > >> >> >> with broker3 to make the cluster accessible remotely. > >> >> >> > >> >> >> BR! > >> >> >> MIhkel > >> >> >> > >> >> >> On 18 October 2015 at 17:49, Clebert > >> wrote: > >> >> >> > >> >> >>> Im not on my computer now but it sounds like you are doing a fa= il > >> back > >> >> >>> immediately after failed over. It takes some time (seconds) to > the > >> >> server > >> >> >>> to activate on the backup. > >> >> >>> > >> >> >>> Later the server will need to copy the data back before it can = be > >> >> >>> activated in fail back mode. > >> >> >>> > >> >> >>> It sounds the live is not reaching backup for fail back. > >> >> >>> > >> >> >>> I will try looking it at it on Monday. Maybe you could post you= r > >> >> example > >> >> >>> at your GitHub fork. > >> >> >>> > >> >> >>> -- Clebert Suconic typing on the iPhone. > >> >> >>> > >> >> >>> > On Oct 18, 2015, at 08:15, Mihkel N=C3=B5ges < > >> >> mihkel.noges@transferwise.com> > >> >> >>> wrote: > >> >> >>> > > >> >> >>> > Hello again! > >> >> >>> > > >> >> >>> > I would be very grateful If someone could answer my questions= . > We > >> >> need > >> >> >>> the high availability to work to use the broker in production. > >> >> >>> > > >> >> >>> > When I run the replicated-failback example in one machine > >> (broker3) > >> >> it > >> >> >>> succeeds. > >> >> >>> > > >> >> >>> > It fails when I run the same test - exactly the same servers > with > >> >> >>> slightly modified client remotely. > >> >> >>> > > >> >> >>> > I run client in debug mode from my IDE with commented out > >> serverStart > >> >> >>> and killServer calls. > >> >> >>> > Deleted data folders and started the servers: > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> rm -R server0/data/ > >> >> >>> > > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> rm -R server1/data/ > >> >> >>> > > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> server0/bin/artemis-service start > >> >> >>> > > >> >> >>> > Starting artemis-service > >> >> >>> > > >> >> >>> > artemis-service is now running (23357) > >> >> >>> > > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> server1/bin/artemis-service start > >> >> >>> > > >> >> >>> > Starting artemis-service > >> >> >>> > > >> >> >>> > artemis-service is now running (23383) > >> >> >>> > > >> >> >>> > Starting client and stopping on breakpoint at line 103: > >> >> >>> > //ServerUtil.killServer(server0); > >> >> >>> > // Step 11. Acknowledging the 2nd half of the sent messages > will > >> fail > >> >> >>> as failover to the > >> >> >>> > // backup server has occurred > >> >> >>> > try { > >> >> >>> > message0.acknowledge(); //line 103 > >> >> >>> > killing server0 > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> kill -9 23357 > >> >> >>> > > >> >> >>> > Proceeding to breakpoint at line 121: > >> >> >>> > //server0 =3D ServerUtil.startServer(args[0], > >> >> >>> ReplicatedFailbackExample.class.getSimpleName() + "0", 0, 10000= ); > >> >> >>> > > >> >> >>> > // Step 11. Acknowledging the 2nd half of the sent messages > will > >> fail > >> >> >>> as failover to the > >> >> >>> > // backup server has occurred > >> >> >>> > try { > >> >> >>> > message0.acknowledge(); // line 121 > >> >> >>> > Starting server0: > >> >> >>> > artemis@broker3 > >> >> > >> > :/opt/apache-artemis-1.1.0/examples/features/ha/replicated-failback/targe= t$ > >> >> >>> server0/bin/artemis-service start > >> >> >>> > > >> >> >>> > Starting artemis-service > >> >> >>> > > >> >> >>> > artemis-service is now running (24240) > >> >> >>> > > >> >> >>> > Server0 writes ERROR to it's log (see attached > >> server0_artemis.log). > >> >> >>> > Now when trying to proceed with the client it writes the > >> following in > >> >> >>> the log and does not exit, but remains hanging forever: > >> >> >>> > > >> >> >>> > Oct 18, 2015 2:55:34 PM > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImp= l > >> >> >>> fail > >> >> >>> > > >> >> >>> > WARN: AMQ212037: Connection failure has been detected: > >> AMQ119015: The > >> >> >>> connection was disconnected because of server shutdown > >> >> [code=3DDISCONNECTED] > >> >> >>> > > >> >> >>> > Got message: This is text message 20 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got exception while acknowledging message: AMQ119014: Timed o= ut > >> after > >> >> >>> waiting 30,000 ms for response when sending packet 43 > >> >> >>> > > >> >> >>> > Got message: This is text message 21 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 22 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 23 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 24 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 25 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 26 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 27 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 28 (redelivered?: false) > >> >> >>> > > >> >> >>> > Got message: This is text message 29 (redelivered?: false) > >> >> >>> > > >> >> >>> > As a result the slave (server1) remains stopped, not restarte= d > as > >> >> >>> expected and the master (server0) process appears to be running > but > >> >> does > >> >> >>> not accept any connections. > >> >> >>> > > >> >> >>> > Exactly the same behavior is observable every time I try this= . > >> >> >>> > > >> >> >>> > BR! > >> >> >>> > Mihkel > >> >> >>> > > >> >> >>> >> On 13 October 2015 at 20:17, Mihkel N=C3=B5ges < > >> >> >>> mihkel.noges@transferwise.com> wrote: > >> >> >>> >> Hi Clebert, > >> >> >>> >> > >> >> >>> >> No test, just doing it on command line with standalone > servers. > >> I'm > >> >> >>> using 1.1.0 installed with wget, not the snapshot. > >> >> >>> >> > >> >> >>> >> I'm wondering what should be the suggested procedure for > admins > >> to > >> >> do > >> >> >>> changes to HA cluster of 2 or 3 nodes of Artemis. If one of the > >> nodes > >> >> is > >> >> >>> master by configuration, do they need to change it's config > before > >> >> >>> restarting it to become slave to have seamless change process a= nd > >> make > >> >> some > >> >> >>> instance master by configuration only if all the instances need > to > >> be > >> >> >>> restarted? > >> >> >>> >> > >> >> >>> >> I tried also a cluster with 2 masters and 2 slaves with 2 > >> separate > >> >> >>> group-name values, but for some reason the second master I > started > >> >> became > >> >> >>> slave for the first immediately. I expected it to become a > >> clustered > >> >> >>> load-balancing parallel master. Our loads are not yet that high > to > >> >> require > >> >> >>> more than one master, so it's just a theoretical question. > >> >> >>> >> > >> >> >>> >> BR! > >> >> >>> >> Mihkel > >> >> >>> >> > >> >> >>> >>> On 13 October 2015 at 20:03, Clebert Suconic < > >> >> >>> clebert.suconic@gmail.com> wrote: > >> >> >>> >>> The master needs to copy its data from the backup back to > live > >> >> before > >> >> >>> >>> it's activated. > >> >> >>> >>> > >> >> >>> >>> Do you have a test replicating this? > >> >> >>> >>> > >> >> >>> >>> Did you try the snapshot build? > >> >> >>> >>> > >> >> >>> >>> On Tue, Oct 13, 2015 at 11:58 AM, Mihkel N=C3=B5ges > >> >> >>> >>> wrote: > >> >> >>> >>> > Hi, > >> >> >>> >>> > > >> >> >>> >>> > I configured replicating HA master-slave of Artemis 1.1.0 > >> >> instances > >> >> >>> on > >> >> >>> >>> > Ubuntu 14.04.3. > >> >> >>> >>> > > >> >> >>> >>> > When I kill master the slave takes over as expected and > >> starts > >> >> >>> serving as > >> >> >>> >>> > new master. When I then start the old master, it fails wi= th > >> the > >> >> >>> following > >> >> >>> >>> > errors in the log: > >> >> >>> >>> > > >> >> >>> >>> > 16:35:46,476 ERROR > [org.apache.activemq.artemis.core.server] > >> >> >>> AMQ224008: > >> >> >>> >>> > Failed to store id: java.lang.IllegalStateException: Cann= ot > >> find > >> >> >>> add info 1 > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRec= ord(JournalImpl.java:799) > >> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalBase.appendDeleteRec= ord(JournalBase.java:183) > >> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRec= ord(JournalImpl.java:79) > >> >> >>> >>> > [artemis-journal-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageM= anager.deleteID(JournalStorageManager.java:1194) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGener= ator.deleteID(BatchingIDGenerator.java:152) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.BatchingIDGener= ator.cleanup(BatchingIDGenerator.java:75) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageM= anager.loadBindingJournal(JournalStorageManager.java: > >> >> >>> 1784) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJourn= als(ActiveMQServerImpl.java: > >> >> >>> 1625) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialis= ePart2(ActiveMQServerImpl.java: > >> >> >>> 1535) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivatio= n.run(SharedNothingBackupActivation.java:249) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60= ] > >> >> >>> >>> > > >> >> >>> >>> > 16:35:46,572 WARN > [org.apache.activemq.artemis.core.server] > >> >> >>> AMQ222173: > >> >> >>> >>> > Queue jms.queue.DLQ is duplicated during reload. This que= ue > >> will > >> >> be > >> >> >>> renamed > >> >> >>> >>> > as jms.queue.DLQ-0 > >> >> >>> >>> > 16:35:46,572 ERROR > [org.apache.activemq.artemis.core.server] > >> >> >>> AMQ224000: > >> >> >>> >>> > Failure in initialisation: java.lang.IllegalStateExceptio= n: > >> >> Cursor > >> >> >>> 2 had > >> >> >>> >>> > already been created > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImp= l.createSubscription(PageCursorProviderImpl.java:97) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.PostOfficeJournalLoader.init= Queues(PostOfficeJournalLoader.java:140) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.loadJourn= als(ActiveMQServerImpl.java: > >> >> >>> 1631) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.initialis= ePart2(ActiveMQServerImpl.java: > >> >> >>> 1535) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at > >> >> >>> >>> > > >> >> >>> > >> >> > >> > org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivatio= n.run(SharedNothingBackupActivation.java:249) > >> >> >>> >>> > [artemis-server-1.1.0.jar:1.1.0] > >> >> >>> >>> > at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_60= ] > >> >> >>> >>> > > >> >> >>> >>> > As a result both master and the slave remain unaccessible > >> and no > >> >> >>> further > >> >> >>> >>> > restarts solve the situation. > >> >> >>> >>> > > >> >> >>> >>> > Attached also master and slave broker.xml files. > >> >> >>> >>> > > >> >> >>> >>> > BR! > >> >> >>> >>> > > >> >> >>> >>> > Mihkel N=C3=B5ges > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> -- > >> >> >>> >>> Clebert Suconic > >> >> >>> > > >> >> >>> > >> >> >> > >> >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> Clebert Suconic > >> >> > >> > >> > >> > >> -- > >> Clebert Suconic > >> > > > > >