activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Bish <tabish...@gmail.com>
Subject Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.
Date Tue, 24 Feb 2015 18:09:17 GMT
On 02/24/2015 01:01 PM, Kevin Burton wrote:
> How long do the tests usually take?

An eternity + 1hr

> I’m looking at 45 minutes right now
> before I gave up… I think part of it was that the box I was testing on was
> virtualized and didn’t have enough resources.
>
> I tried to parallelize the tests (-T 8 with maven) but I got other errors
> so I assume the ports are singletons.
They won't be happy if you run them parallel.

>
> On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <gary.tully@gmail.com> wrote:
>
>> if there are any test failures - try to run them individually
>> -Dtest=a,b etc. There may be an issue with a full test run, but all of
>> the tests that are enabled should work. I know there are some issues
>> with jdbc tests that hang or fail due to previous runs no cleaning up,
>> but that should be the most of it. I got a bunch of full test runs
>> before the 5.11 release if that is any help.
>>
>> On 23 February 2015 at 20:38, Kevin Burton <burton@spinn3r.com> wrote:
>>> OK.  This is ready to go and I have a patch branch:
>>>
>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>
>>> I’m stuck at the moment though because tests don’t pass.  But it was
>>> failing tests before so I don’t think it has anything to do with my
>>> changes.
>>>
>>>
>>>
>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <burton@spinn3r.com>
>> wrote:
>>>> Actually, is the lock even needed here?  Why would it be?  if we’re
>>>> *removing* a subscription, why does it care if we possibly ALSO remove a
>>>> separate / isolated queue before/after the subscription is removed.
>>>>
>>>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>>>
>>>> I looked at all the callers and none were associated with queues.
>>>>
>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <burton@spinn3r.com>
>> wrote:
>>>>> So I have some working/theoretical code that should resolve this.
>>>>>
>>>>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
>>>>> contention.
>>>>>
>>>>> But here’s where I’m stuck.
>>>>>
>>>>>      @Override
>>>>>>      public void removeSubscription(ConnectionContext context,
>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>          inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>          try {
>>>>>>              topicRegion.removeSubscription(context, info);
>>>>>>          } finally {
>>>>>>              inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>          }
>>>>>>      }
>>>>>
>>>>> .. this is in RegionBroker.
>>>>>
>>>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>>>> way to resolve this.
>>>>>
>>>>> Any advice?
>>>>>
>>>>>
>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <burton@spinn3r.com>
>> wrote:
>>>>>> Yes.  That was my thinking too.. that just replacing the
>> CopyOnWriteArraySet
>>>>>> with something more performance would solve the issue.
>>>>>>
>>>>>> This would also improve queue creation time as well as queue deletion
>>>>>> time.
>>>>>>
>>>>>> What I think I”m going to do in the mean time is:
>>>>>>
>>>>>> - implement a granular lock based on queue name… I am going to
use an
>>>>>> interface so we can replace the implementation later.
>>>>>>
>>>>>> - implement timing for the purge thread so it tracks how long it
takes
>>>>>> to remove a queue but also how long the entire loop takes.
>>>>>>
>>>>>> I’ll do this on a branch so it should be easy to merge.
>>>>>>
>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <tbain@alumni.duke.edu>
>> wrote:
>>>>>>> A decent amount of the time is being spent calling remove() on
>> various
>>>>>>> array-backed collections.  Those data structures might be
>> inappropriate
>>>>>>> for
>>>>>>> the number of destinations you're running, since array-backed
>>>>>>> collections
>>>>>>> tend to have add/remove operations that are O(N); some improvement
>> might
>>>>>>> come from something as simple as moving to a ConcurrentHashSet
>> instead
>>>>>>> of a
>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>> worse
>>>>>>> because of other aspects of how those collections are used; people
>> other
>>>>>>> than me would be in a better position to evaluate the full range
of
>>>>>>> performance requirements for those collections.)
>>>>>>>
>>>>>>> Scheduler.cancel() also takes an alarming amount of time for
what
>> looks
>>>>>>> like a really simple method (
>>>>>>>
>>>>>>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>>>>> ).
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here’s a jprofiler view with the advisory support enabled
if you’re
>>>>>>>> curious.
>>>>>>>>
>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>
>>>>>>>> I’m not familiar with the internals of ActiveMQ enough
to have any
>>>>>>> obvious
>>>>>>>> optimization ideas.
>>>>>>>>
>>>>>>>> One other idea I had (which would require a ton of refactoring
I
>>>>>>> think)
>>>>>>>> would be to potentially bulk delete all the queues at once.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>> wrote:
>>>>>>>>> And spending some more time in jprofiler, looks like
20% of this
>> is
>>>>>>> do to
>>>>>>>>> schedulerSupport and the other 80% of this is do to
>> advisorySupport.
>>>>>>>>> if I set both to false the total runtime of my tests
drops in
>> half…
>>>>>>> and
>>>>>>>>> the latencies fall from
>>>>>>>>>
>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>> max create message on existing producer and consumer:
2 ms
>>>>>>>>>
>>>>>>>>> … to
>>>>>>>>>
>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>> max create message on existing producer and consumer:
1 ms
>>>>>>>>>
>>>>>>>>> and this isn’t without fixing the purge background
lock.
>>>>>>>>>
>>>>>>>>> So the question now is what the heck is the advisory
support
>> doing
>>>>>>> that
>>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>
>>>>>>>>> … and I think advisorySupport is enabled by default
so that’s
>>>>>>> problematic
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
>> burton@spinn3r.com>
>>>>>>>> wrote:
>>>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s
not LevelDB.
>>>>>>>>>>
>>>>>>>>>> This is a non-persistent broker I’m testing on.
>>>>>>>>>>
>>>>>>>>>> Looks like it’s spending all it’s time in
>>>>>>> CopyOnWriteArrayList.remove
>>>>>>>> and
>>>>>>>>>> Timer.purge…
>>>>>>>>>>
>>>>>>>>>> Which is hopeful because this is ALL due to ActiveMQ
internals
>> and
>>>>>>> in
>>>>>>>>>> theory LevelDB should perform well if we improve
the
>> performance of
>>>>>>>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>>>>
>>>>>>>>>> Which would rock!
>>>>>>>>>>
>>>>>>>>>> It should ALSO make queue creation faster.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
>> burton@spinn3r.com>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>> tbain@alumni.duke.edu>
>>>>>>>> wrote:
>>>>>>>>>>>> So if LevelDB cleanup during removeDestination()
is the
>> presumed
>>>>>>>>>>>> culprit,
>>>>>>>>>>>> can we spin off the LevelDB cleanup work
into a separate
>> thread
>>>>>>>>>>>> (better: a
>>>>>>>>>>>> task object to be serviced by a ThreadPool
so you can avoid a
>>>>>>> fork
>>>>>>>> bomb
>>>>>>>>>>>> if
>>>>>>>>>>>> we remove many destinations at once) so the
call to
>>>>>>>> removeDestination()
>>>>>>>>>>>> can
>>>>>>>>>>>> return quickly and LevelDB can do its record-keeping
in the
>>>>>>> background
>>>>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>
>>>>>>>>>>> Would that be possible?  If the delete is pending
on ActiveMQ
>>>>>>> there is
>>>>>>>> a
>>>>>>>>>>> race where a producer could re-create it unless
the lock is
>> held.
>>>>>>>>>>> Though I guess if you dispatched to the GC thread
WITH the lock
>>>>>>> still
>>>>>>>>>>> held you would be ok but I think if we use the
existing purge
>>>>>>> thread
>>>>>>>> then
>>>>>>>>>>> we’re fine.
>>>>>>>>>>>
>>>>>>>>>>> OK. I think I’m wrong about LevelDB being the
issue.  To be
>> fair I
>>>>>>>>>>> wasn’t 100% certain before but I should have
specified.
>>>>>>>>>>>
>>>>>>>>>>> Our current production broker is running with
>> persistent=false..
>>>>>>> .and I
>>>>>>>>>>> just re-ran the tests without disk persistence
and it has the
>> same
>>>>>>>> problem.
>>>>>>>>>>> So the main issue how is why the heck is ActiveMQ
taking SO
>> LONG
>>>>>>> to GC
>>>>>>>> a
>>>>>>>>>>> queue.  It’s taking about 100ms which is an
insane amount of
>> time
>>>>>>>>>>> considering this is done all in memory.
>>>>>>>>>>>
>>>>>>>>>>> Kevin
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>> … or check out my Google+ profile
>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>> Location: *San Francisco, CA*
>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>> … or check out my Google+ profile
>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>> <http://spinn3r.com>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>> <http://spinn3r.com>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>> <http://spinn3r.com>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>>
>>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>
>


-- 
Tim Bish
Sr Software Engineer | RedHat Inc.
tim.bish@redhat.com | www.redhat.com
skype: tabish121 | twitter: @tabish121
blog: http://timbish.blogspot.com/


Mime
View raw message