activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hendley, Sam" <Sam.Hend...@sensus.com>
Subject Exceeding MemoryUsage causes Network Connector connections to stop
Date Tue, 30 Dec 2014 22:46:20 GMT
Hello ActiveMQ community:

TL;DR: I now think this is really a mis-configuration on our part but it took quite a lot
of digging before we nailed the issue, I am reporting this to save others time in the future.

We are running a "store and forward network of brokers" where each broker is connected to
all other brokers (full mesh). Our applications connect only to their local broker. Under
load we would occasionally see a broker just "disappear" from the rest of the cluster and
all of the work would end up on the remaining nodes. We were having trouble isolating the
fault since our overall system wasn't handling this gracefully and was causing other traffic
making cause and effect difficult to trace down.

I set out to reproduce the failure we were having in as small of a case as I could. The result
is at: https://github.com/samhendley/activemq-bug-reports where I document the experiment
more fully. I wasn't able to get a 100% reproduction, best I could do was get to about 50%
of the runs on my machine failing. This makes me believe it is probably a race condition,
but I wasn't able to find any obvious smoking guns.

In short I found that if the overall broker MemoryUsage is exceeded (because producer flow
control is off) then sometimes the network connectors between the brokers would become stuck.
If I enabled producer flow control or increased the configured max memory the issue was no
longer reproducible.

It looks like we can reconfigure our production systems to workaround this problem but should
I file a bug for this? A silent failure like this is really not fun to run to diagnose on
a large scale system.

Sam

>From github page:

Bug description:

If the configured MemoryStore limit is large enough to stay below 100% while the requestor
application is dumping messages into the broker network the tests passes successfully. If
however the memory usage on the brokers goes larger than 100% (in this case peaking around
600% of 100 Mb) the network connectors sometimes become "stuck". Stuck in this case means
there are messages enqueued on one or both of the "server" brokers but the messages are not
being dequeued or forwarded by the network connector back to the "client" broker.

This issues doesn't happen with every run with a small memory size but in my tests it generally
failed about 50% of the times I tried running it. You may have to run it a few times before
getting it to fail. On one failure JMX showed that 417k responses had been generated on server1
but only 363k had been dequeued for transmission to the client broker. In that test run the
other server had correctly handled the other 583k requests.

When it does fail there is nothing in the log that indicates anything is amiss. I would have
expected to see some sort of log message to indicate that the network connector has been throttled
(if indeed that is what is happening). This same test done with a single broker always passes
which leads me to believe it really is a problem with the network connectors.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message