activemq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Anderson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AMQ-6092) Clear Broker to Broker Connection Info At Startup
Date Thu, 17 Dec 2015 00:02:46 GMT
John Anderson created AMQ-6092:
----------------------------------

             Summary: Clear Broker to Broker Connection Info At Startup
                 Key: AMQ-6092
                 URL: https://issues.apache.org/jira/browse/AMQ-6092
             Project: ActiveMQ
          Issue Type: Bug
          Components: activemq-leveldb-store
    Affects Versions: 5.12.0
         Environment: Linux
            Reporter: John Anderson
            Priority: Minor


This is a very difficult bug to describe, and an even tougher bug to replicate, so I guess
I'll start by describing the circumstances that triggered this bug.

At each of 3 data centers I have replicated leveldb ActiveMQ cluster.  There are store and
forward connections between each data center. Phoenix has non-duplexed connections to Amsterdam
and Ashburn, and in turn each of those sites has connections to the others.  This makes a
mesh type topography. Within a single datacenter, I have 3 copies of each broker using the
replicated LevelDB feature in a kind of active/passive/passive configuration.

This is just a PoC setup, sitting on VMware infrastructure, and it sat idle for quite some
time.  At some point, while it was sitting idle, we had a storage maintenance, which caused
a storage disconnect in Ashburn and Amsterdam.  A storage disconnect is akin to just pulling
the disk out of the box.  Needless to say, AMQ didn't like this one bit.  However, surviving
a storage disconnect isn't really the point of the bug.  The bug came in to play when I tried
restarting the cluster after storage was restored.   

I restarted each of the VMs, and began to bring the ActiveMQ instances back online, starting
zookeeper, then starting ActiveMQ.   After bringing each replicated LevelDB group back up,
they refused to reconnect to each other via the store & forward connections.  I kept getting
this error:


bq. Failed to add Connection ams1-1->ash1-1-38769-1450213134683-58409:1 due to javax.jms.InvalidClientIDException:
Broker: ams1-1 - Client: ams1-1_ash1-1_queues_ash1-1_inbound_ams1-1 already connected from
vm://ams1-1#0 | org.apache.activemq.broker.TransportConnection | triggerStartAsyncNetworkBridgeCreation:
remoteBroker=unconnected, localBroker= vm://ams1-1#58408


Not a single broker would connect to another broker, and the messages imply that these connections
already existed.  However, I could see that the connections were trying to be established,
using netstat, and the fact that this message occured over and over, like they were retrying.
 However, the web-based admin console showed nothing under Network.  Not a single real connection
was made.

After a lot of troubleshooting, especially looking into the LDAP Authentication/Authorization
settings and mechanism, I finally figured that it had to be something persisted, because this
exact same setup, without a single configuration change, had been working perfectly before
the storage disconnect.

In the end, I ended up completely deleting the LevelDB directory, and restarting ActiveMQ
on each node, and the setup is working flawlessly once again.

I haven't yet tried 5.13.0, and I'm pretty sure management isn't going to allow me to cause
a storage disconnect so I can test it, but I have a feeling that some information about store
& forward connections is stored in the persistent store, and some sort of short-write
occurred when the storage disconnect happened.  However, since this data, whatever it may
be, wasn't cleared or reset at broker startup, the broker erroneously believed that the connections
I was trying to establish already existed.

This may be an incorrect assumption, but at startup, the broker should reset any data it has
that pertains to store and forward connections, because there's no way anything can really
be connected at that time.

I'll attach my configurations so that the environment, if not the storage disconnect, can
be replicated.

The steps to reproduce, if they were practical would be:

1.) Setup an AMQ store & forward mesh based on the attached configurations, and on VMWare
ESX infrastructure.
2.) Cause a storage interruption.
3.) Reboot the VMs running AMQ to reset the read-only state of the block devices, after the
storage interruption.
4.) Try to bring the cluster back online.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message