activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <bur...@spinn3r.com>
Subject Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.
Date Sun, 22 Feb 2015 21:08:43 GMT
Btw. another way to fix this is to set the purge interval low, say 15
seconds, and then set the max number of queues to delete each time to a low
value.

This shouldn’t be as pretty as using one lock per queue but would be easy
to implement without modifying much code.

Kevin

On Sun, Feb 22, 2015 at 1:01 PM, Kevin Burton <burton@spinn3r.com> wrote:

> OK.  I think I have a handle regarding what’s happening during queue
> purges that cause GC lockups.
>
> Wanted to get your feedback.
>
> I can create a bug for this if you guys think my assessment is accurate as
> I think the fix is someone reasonable / easy.
>
> I have a unit test which duplicates this now but I need to do more cleanup
> so I can put it into a public github repo for you guys to look at.
>
> ## Problem overview.
>
> ActiveMQ supports a feature where it can GC a queue that is inactive. IE
> now messages and no consumers.
>
> However, there’s a bug where
>
> purgeInactiveDestinations
>
> in
>
> org.apache.activemq.broker.region.RegionBroker
>
> creates a read/write lock (inactiveDestinationsPurgeLock) which is held
> during the entire queue GC.
>
> each individual queue GC takes about 100ms with a disk backed queue and
> 10ms with a memory backed (non-persistent) queue. If you have thousands of
> them to GC at once the inactiveDestinationsPurgeLock lock is held the
> entire time which can last from 60 seconds to 5 minutes (and essentially
> unbounded).
>
> A read lock is also held for this in addConsumer addProducer so that when
> a new consumer or produce tries to connect, they’re blocked until queue GC
> completes.
>
> Existing producers/consumers work JUST fine.
>
> The lock MUST be held on each queue because if it isn’t there’s a race
> where a queue is flagged to be GCd , then a producer comes in and writes a
> new message, then the background thread deletes the queue which it marked
> as GCable but it had the newly produced message.  This would result in data
> loss.
>
> ## Confirmed
>
> I have a unit tests now that confirms this.   I create 7500 queues,
> produce 1 message in each, then consume it. I keep all consumers open.
>  then I release all 7500 queues at once.
>
> I then have an consumer/producer pair I hold open and produce and consume
> messages on it.  this works fine.
>
> However, I have another which creates a new producer each time.  This will
> block for 60,000ms multiple time while queue GC is happening in the
> background.
>
> ## Proposed solution.
>
> Rework the read/write locks to be one lock per queue.
>
> So instead of using one global lock per broker, we use one lock per queue
> name.  This way the locks are FAR more granular and new producers/consumers
> won’t block during this time period.
>
> If a queue named ‘foo’ is being GCd and a new producer is created on a
> ‘bar’ queue the bar producer will work fine and won’t block on the foo
> queue.
>
> This can be accomplished by:
>
> creating a concurrent hash map with the name of the queue as the key (or
> an ActiveMQDestination as the key) which stores read/write locks as the
> values. Then we use this as the lock backing and the purge thread and
> add/remove producers will all reference the more granular lock.
>
> ….
>
> Now initially, I was thinking I would just fix this myself, however, I
> might have a workaround for our queue design which uses less queues, and I
> think this will drop our queue requirement from a few thousand to a few
> dozen.  So at that point this won’t be as much of a priority.
>
> However, this is a significant scalability issue with ActiveMQ… one that
> doesn’t need to exist.  In our situation I think our performance would be
> fine even with 7500 queues once this bug is fixed.
>
> Perhaps it should just exist as an open JIRA that could be fixed at some
> time in the future?
>
> I can also get time to clean up a project with a test which demonstrates
> this problem.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message