activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hadrian Zbarcea <hzbar...@gmail.com>
Subject Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.
Date Tue, 24 Feb 2015 23:31:10 GMT
Kevin,

Go for it.

Yes, out tests need a lot of love. Actually most of the tests are not 
really unit tests, but more like integration tests.
Helping with the spring (as in season) cleanup would be immensely 
appreciated.

Hadrian


On 02/24/2015 02:18 PM, Kevin Burton wrote:
> Ah.  OK… I assume continuous integration isn’t a goal.  I’m upset if my
> testing takes more than 20 minutes :-P
>
> I bet that this would be taken down to an hour if they were parallelized.
> But of course someone needs to take the time to get that done.
>
> I might be allocating some of our internal dev resources to getting our
> build to parallelize across containers.  If this generalizable to other
> projects I’ll OSS it.. Would be great to get better testing feedback.
>
> Right now I think my patches are ready to be merged but I’d feel more
> comfortable suggesting it if I knew the tests worked.
>
> Should I propose it now and if there are any bugs we can fix them later?
>
> Kevin
>
> On Tue, Feb 24, 2015 at 10:54 AM, Timothy Bish <tabish121@gmail.com> wrote:
>
>> On 02/24/2015 01:22 PM, Kevin Burton wrote:
>>
>>> oh  geez… seriously though.  can you give me an estimate?
>>>
>>> Assuming 30 seconds each I’m assuming about 11 hours.
>>>
>> That's probably not to far off, usually just let them run over night for
>> the all profile.
>>
>>
>>> http://activemq.apache.org/junit-reports.html
>>> https://hudson.apache.org/hudson/job/ActiveMQ/
>>> Hudson is down for ActiveMQ.
>>>
>>> Also, is there a disk space requirement for the tests? I have 10GB free or
>>> so and I’m getting TONS of these:
>>>
>> I don't think anyone has looked into the min required free for the tests
>> to run.
>>
>>
>>
>>> java.lang.RuntimeException: Failed to start provided job scheduler store:
>>> JobSchedulerStore:activemq-data
>>>
>>> at java.io.RandomAccessFile.seek(Native Method)
>>>
>>> at
>>> org.apache.activemq.util.RecoverableRandomAccessFile.seek(
>>> RecoverableRandomAccessFile.java:384)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.disk.page.PageFile.
>>> readPage(PageFile.java:877)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.disk.page.Transaction.
>>> load(Transaction.java:427)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.disk.page.Transaction.
>>> load(Transaction.java:377)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.disk.index.BTreeIndex.
>>> load(BTreeIndex.java:159)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$
>>> MetaData.load(JobSchedulerStoreImpl.java:90)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$3.
>>> execute(JobSchedulerStoreImpl.java:277)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.disk.page.Transaction.
>>> execute(Transaction.java:779)
>>>
>>> at
>>> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl.doStart(
>>> JobSchedulerStoreImpl.java:261)
>>>
>>> at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)
>>>
>>> at
>>> org.apache.activemq.broker.BrokerService.setJobSchedulerStore(
>>> BrokerService.java:1891)
>>>
>>> at
>>> org.apache.activemq.transport.stomp.StompTestSupport.
>>> createBroker(StompTestSupport.java:174)
>>>
>>> at
>>> org.apache.activemq.transport.stomp.StompTestSupport.
>>> startBroker(StompTestSupport.java:112)
>>>
>>> at
>>> org.apache.activemq.transport.stomp.StompTestSupport.setUp(
>>> StompTestSupport.java:94)
>>>
>>> at
>>> org.apache.activemq.transport.stomp.Stomp12Test.setUp(
>>> Stomp12Test.java:41)
>>>
>>> at
>>> org.apache.activemq.transport.stomp.Stomp12SslAuthTest.
>>> setUp(Stomp12SslAuthTest.java:38)
>>>
>>> On Tue, Feb 24, 2015 at 10:09 AM, Timothy Bish <tabish121@gmail.com>
>>> wrote:
>>>
>>>   On 02/24/2015 01:01 PM, Kevin Burton wrote:
>>>>   How long do the tests usually take?
>>>>>   An eternity + 1hr
>>>>    I’m looking at 45 minutes right now
>>>>
>>>>> before I gave up… I think part of it was that the box I was testing
on
>>>>> was
>>>>> virtualized and didn’t have enough resources.
>>>>>
>>>>> I tried to parallelize the tests (-T 8 with maven) but I got other
>>>>> errors
>>>>> so I assume the ports are singletons.
>>>>>
>>>>>   They won't be happy if you run them parallel.
>>>>
>>>>
>>>>   On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <gary.tully@gmail.com>
>>>>> wrote:
>>>>>
>>>>>    if there are any test failures - try to run them individually
>>>>>
>>>>>> -Dtest=a,b etc. There may be an issue with a full test run, but all
of
>>>>>> the tests that are enabled should work. I know there are some issues
>>>>>> with jdbc tests that hang or fail due to previous runs no cleaning
up,
>>>>>> but that should be the most of it. I got a bunch of full test runs
>>>>>> before the 5.11 release if that is any help.
>>>>>>
>>>>>> On 23 February 2015 at 20:38, Kevin Burton <burton@spinn3r.com>
wrote:
>>>>>>
>>>>>>   OK.  This is ready to go and I have a patch branch:
>>>>>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>>>>>
>>>>>>> I’m stuck at the moment though because tests don’t pass.
 But it was
>>>>>>> failing tests before so I don’t think it has anything to do
with my
>>>>>>> changes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>>
>>>>>>>   wrote:
>>>>>>   Actually, is the lock even needed here?  Why would it be?  if we’re
>>>>>>>> *removing* a subscription, why does it care if we possibly
ALSO
>>>>>>>> remove
>>>>>>>> a
>>>>>>>> separate / isolated queue before/after the subscription is
removed.
>>>>>>>>
>>>>>>>> I think this is redundant and can be removed.  Maybe I’m
wrong
>>>>>>>> though.
>>>>>>>>
>>>>>>>> I looked at all the callers and none were associated with
queues.
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>>>
>>>>>>>>   wrote:
>>>>>>> So I have some working/theoretical code that should resolve this.
>>>>>>>
>>>>>>>> It acquires a lock *per* ActiveMQDestination, this way there
is no
>>>>>>>>> lock
>>>>>>>>> contention.
>>>>>>>>>
>>>>>>>>> But here’s where I’m stuck.
>>>>>>>>>
>>>>>>>>>        @Override
>>>>>>>>>
>>>>>>>>>         public void removeSubscription(ConnectionContext
context,
>>>>>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>>>>>            inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>>>>>            try {
>>>>>>>>>>                topicRegion.removeSubscription(context,
info);
>>>>>>>>>>            } finally {
>>>>>>>>>>                inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>>>>>            }
>>>>>>>>>>        }
>>>>>>>>>>
>>>>>>>>>>   .. this is in RegionBroker.
>>>>>>>>> There is no ActiveMQDestination involved here so I’m
not sure the
>>>>>>>>> best
>>>>>>>>> way to resolve this.
>>>>>>>>>
>>>>>>>>> Any advice?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>>>>
>>>>>>>>>   wrote:
>>>>>>> Yes.  That was my thinking too.. that just replacing the
>>>>>>>
>>>>>>>> CopyOnWriteArraySet
>>>>>>>> with something more performance would solve the issue.
>>>>>>>> This would also improve queue creation time as well as queue
deletion
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> What I think I”m going to do in the mean time is:
>>>>>>>>>>
>>>>>>>>>> - implement a granular lock based on queue name…
I am going to use
>>>>>>>>>> an
>>>>>>>>>> interface so we can replace the implementation later.
>>>>>>>>>>
>>>>>>>>>> - implement timing for the purge thread so it tracks
how long it
>>>>>>>>>> takes
>>>>>>>>>> to remove a queue but also how long the entire loop
takes.
>>>>>>>>>>
>>>>>>>>>> I’ll do this on a branch so it should be easy to
merge.
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <tbain@alumni.duke.edu>
>>>>>>>>>>
>>>>>>>>>>   wrote:
>>>>>>>> A decent amount of the time is being spent calling remove()
on
>>>>>>>> various
>>>>>>>>> array-backed collections.  Those data structures might
be
>>>>>>>> inappropriate
>>>>>>>>> for
>>>>>>>> the number of destinations you're running, since array-backed
>>>>>>>>>>> collections
>>>>>>>>>>> tend to have add/remove operations that are O(N);
some improvement
>>>>>>>>>>>
>>>>>>>>>>>   might
>>>>>>>>> come from something as simple as moving to a ConcurrentHashSet
>>>>>>>> instead
>>>>>>>>> of a
>>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>>>>>>>>>>>   worse
>>>>>>>>> because of other aspects of how those collections are
used; people
>>>>>>>> other
>>>>>>>>> than me would be in a better position to evaluate the
full range of
>>>>>>>> performance requirements for those collections.)
>>>>>>>>>>> Scheduler.cancel() also takes an alarming amount
of time for what
>>>>>>>>>>>
>>>>>>>>>>>   looks
>>>>>>>>> like a really simple method (
>>>>>>>>>>>    http://grepcode.com/file/repo1.maven.org/maven2/org.
>>>>>>>>>>>
>>>>>>>>>> apache.activemq/activemq-all/5.10.0/org/apache/activemq/
>>>>>> thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>>>>
>>>>>>   ).
>>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <burton@spinn3r.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>    Here’s a jprofiler view with the advisory
support enabled if
>>>>>>>>>>> you’re
>>>>>>>>>>>
>>>>>>>>>>>> curious.
>>>>>>>>>>>>
>>>>>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>>>>>
>>>>>>>>>>>> I’m not familiar with the internals of
ActiveMQ enough to have
>>>>>>>>>>>> any
>>>>>>>>>>>>
>>>>>>>>>>>>   obvious
>>>>>>>>>>>   optimization ideas.
>>>>>>>>>>>> One other idea I had (which would require
a ton of refactoring I
>>>>>>>>>>>>
>>>>>>>>>>>>   think)
>>>>>>>>>>>   would be to potentially bulk delete all the
queues at once.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton
<
>>>>>>>>>>>> burton@spinn3r.com>
>>>>>>>>>>>>
>>>>>>>>>>>>   wrote:
>>>>>>>>>>>   And spending some more time in jprofiler, looks
like 20% of this
>>>>>>>>>>>> is
>>>>>>>>>>>>
>>>>>>>>>>> do to
>>>>>>>> schedulerSupport and the other 80% of this is do to
>>>>>>>>>>>> advisorySupport.
>>>>>>>>>>>>
>>>>>>>>>>> if I set both to false the total runtime of my
tests drops in
>>>>>>>> half…
>>>>>>>>>>> and
>>>>>>>> the latencies fall from
>>>>>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>>>>>
>>>>>>>>>>>>>   max create message on existing producer
and consumer: 2 ms
>>>>>>>>>>>>>>   … to
>>>>>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>>>>>
>>>>>>>>>>>>>   max create message on existing producer
and consumer: 1 ms
>>>>>>>>>>>>>>   and this isn’t without fixing
the purge background lock.
>>>>>>>>>>>>> So the question now is what the heck
is the advisory support
>>>>>>>>>>>>>
>>>>>>>>>>>>>   doing
>>>>>>>>>>> that
>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>>>>> … and I think advisorySupport is enabled
by default so that’s
>>>>>>>>>>>>>
>>>>>>>>>>>>>   problematic
>>>>>>>>>>>> as well.
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin
Burton <
>>>>>>>>>>>>>
>>>>>>>>>>>>>   burton@spinn3r.com>
>>>>>>>>>>> wrote:
>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>>>>>>>>>>> This is a non-persistent broker I’m
testing on.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looks like it’s spending all it’s
time in
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   CopyOnWriteArrayList.remove
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>>   Timer.purge…
>>>>>>>>>>>>>> Which is hopeful because this is
ALL due to ActiveMQ internals
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   and
>>>>>>>>>>>> in
>>>>>>>> theory LevelDB should perform well if we improve the
>>>>>>>>>>>>> performance of
>>>>>>>>>>>>>
>>>>>>>>>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>> Which would rock!
>>>>>>>>>>>>>> It should ALSO make queue creation
faster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM,
Kevin Burton <
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   burton@spinn3r.com>
>>>>>>>>>>>> wrote:
>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>>>>>>>>>>>>>>>   tbain@alumni.duke.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>> So if LevelDB cleanup during removeDestination() is the
>>>>>>>>>>>>>> presumed
>>>>>>>>>>>>>> culprit,
>>>>>>>> can we spin off the LevelDB cleanup work into a separate
>>>>>>>>>>>>>>>>   thread
>>>>>>>>>>>>>> (better: a
>>>>>>>> task object to be serviced by a ThreadPool so you can avoid
a
>>>>>>>>>>>>>>>>   fork
>>>>>>>>>>>>>> bomb
>>>>>>>>>>>>   if
>>>>>>>>>>>>>> we remove many destinations at once)
so the call to
>>>>>>>>>>>>>>>>   removeDestination()
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> return quickly and LevelDB can do
its record-keeping in the
>>>>>>>>>>>>>>>>   background
>>>>>>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>>    Would that be possible?  If the delete
is pending on ActiveMQ
>>>>>>>>>>>>>>> there is
>>>>>>>>>>>>> a
>>>>>>>>>>>>   race where a producer could re-create it
unless the lock is
>>>>>>>>>>>>>> held.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Though I guess if you dispatched to the
GC thread WITH the lock
>>>>>>>> still
>>>>>>>>>>>>> held you would be ok but I think if we
use the existing purge
>>>>>>>>>>>>> thread
>>>>>>>>>>>>> then
>>>>>>>>>>>>   we’re fine.
>>>>>>>>>>>>>> OK. I think I’m wrong about LevelDB
being the issue.  To be
>>>>>>>>>>>>>>>   fair I
>>>>>>>>>>>>> wasn’t 100% certain before but I should
have specified.
>>>>>>>> Our current production broker is running with
>>>>>>>>>>>>>>>   persistent=false..
>>>>>>>>>>>>> .and I
>>>>>>>> just re-ran the tests without disk persistence and it has
the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> problem.
>>>>>>>> So the main issue how is why the heck is ActiveMQ taking
SO
>>>>>>>>>>>>>> LONG
>>>>>>>>>>>>>>
>>>>>>>>>>>>> to GC
>>>>>>>> a
>>>>>>>>>>>>   queue.  It’s taking about 100ms which
is an insane amount of
>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>
>>>>>>>>>>>>> considering this is done all in memory.
>>>>>>>> Kevin
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   --
>>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   --
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   --
>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>> … or check out my Google+ profile
>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   --
>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>> Location: *San Francisco, CA*
>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>> … or check out my Google+ profile
>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>> <http://spinn3r.com>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>> Founder/CEO Spinn3r.com
>>>>>>> Location: *San Francisco, CA*
>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>> … or check out my Google+ profile
>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>> <http://spinn3r.com>
>>>>>>>
>>>>>>>
>>>>>   --
>>>> Tim Bish
>>>> Sr Software Engineer | RedHat Inc.
>>>> tim.bish@redhat.com | www.redhat.com
>>>> skype: tabish121 | twitter: @tabish121
>>>> blog: http://timbish.blogspot.com/
>>>>
>>>>
>>>>
>> --
>> Tim Bish
>> Sr Software Engineer | RedHat Inc.
>> tim.bish@redhat.com | www.redhat.com
>> skype: tabish121 | twitter: @tabish121
>> blog: http://timbish.blogspot.com/
>>
>>
>


Mime
View raw message