activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <bur...@spinn3r.com>
Subject Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.
Date Sun, 22 Feb 2015 22:06:23 GMT
Here’s the project with a unit tests to demonstrate the bug.

https://github.com/spinn3r/activemq-gc-purge-lockup

The one thing is that I’m missing a SLF4J dependency which doesn’t allow
ActiveMQ to log to the console about the queue GCs… you can then see all
the GC activity.

But it does successfully show that here is significant producer creation
latency.  60-80 seconds.

Kevin


On Sun, Feb 22, 2015 at 1:08 PM, Kevin Burton <burton@spinn3r.com> wrote:

> Btw. another way to fix this is to set the purge interval low, say 15
> seconds, and then set the max number of queues to delete each time to a low
> value.
>
> This shouldn’t be as pretty as using one lock per queue but would be easy
> to implement without modifying much code.
>
> Kevin
>
> On Sun, Feb 22, 2015 at 1:01 PM, Kevin Burton <burton@spinn3r.com> wrote:
>
>> OK.  I think I have a handle regarding what’s happening during queue
>> purges that cause GC lockups.
>>
>> Wanted to get your feedback.
>>
>> I can create a bug for this if you guys think my assessment is accurate
>> as I think the fix is someone reasonable / easy.
>>
>> I have a unit test which duplicates this now but I need to do more
>> cleanup so I can put it into a public github repo for you guys to look at.
>>
>> ## Problem overview.
>>
>> ActiveMQ supports a feature where it can GC a queue that is inactive. IE
>> now messages and no consumers.
>>
>> However, there’s a bug where
>>
>> purgeInactiveDestinations
>>
>> in
>>
>> org.apache.activemq.broker.region.RegionBroker
>>
>> creates a read/write lock (inactiveDestinationsPurgeLock) which is held
>> during the entire queue GC.
>>
>> each individual queue GC takes about 100ms with a disk backed queue and
>> 10ms with a memory backed (non-persistent) queue. If you have thousands of
>> them to GC at once the inactiveDestinationsPurgeLock lock is held the
>> entire time which can last from 60 seconds to 5 minutes (and essentially
>> unbounded).
>>
>> A read lock is also held for this in addConsumer addProducer so that when
>> a new consumer or produce tries to connect, they’re blocked until queue GC
>> completes.
>>
>> Existing producers/consumers work JUST fine.
>>
>> The lock MUST be held on each queue because if it isn’t there’s a race
>> where a queue is flagged to be GCd , then a producer comes in and writes a
>> new message, then the background thread deletes the queue which it marked
>> as GCable but it had the newly produced message.  This would result in data
>> loss.
>>
>> ## Confirmed
>>
>> I have a unit tests now that confirms this.   I create 7500 queues,
>> produce 1 message in each, then consume it. I keep all consumers open.
>>  then I release all 7500 queues at once.
>>
>> I then have an consumer/producer pair I hold open and produce and consume
>> messages on it.  this works fine.
>>
>> However, I have another which creates a new producer each time.  This
>> will block for 60,000ms multiple time while queue GC is happening in the
>> background.
>>
>> ## Proposed solution.
>>
>> Rework the read/write locks to be one lock per queue.
>>
>> So instead of using one global lock per broker, we use one lock per queue
>> name.  This way the locks are FAR more granular and new producers/consumers
>> won’t block during this time period.
>>
>> If a queue named ‘foo’ is being GCd and a new producer is created on a
>> ‘bar’ queue the bar producer will work fine and won’t block on the foo
>> queue.
>>
>> This can be accomplished by:
>>
>> creating a concurrent hash map with the name of the queue as the key (or
>> an ActiveMQDestination as the key) which stores read/write locks as the
>> values. Then we use this as the lock backing and the purge thread and
>> add/remove producers will all reference the more granular lock.
>>
>> ….
>>
>> Now initially, I was thinking I would just fix this myself, however, I
>> might have a workaround for our queue design which uses less queues, and I
>> think this will drop our queue requirement from a few thousand to a few
>> dozen.  So at that point this won’t be as much of a priority.
>>
>> However, this is a significant scalability issue with ActiveMQ… one that
>> doesn’t need to exist.  In our situation I think our performance would be
>> fine even with 7500 queues once this bug is fixed.
>>
>> Perhaps it should just exist as an open JIRA that could be fixed at some
>> time in the future?
>>
>> I can also get time to clean up a project with a test which demonstrates
>> this problem.
>>
>> Kevin
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message