felix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Meschberger <fmesc...@adobe.com>
Subject Re: Concurrency problems in SCR
Date Thu, 08 Mar 2012 04:39:50 GMT
Hi David,

Am 02.03.2012 um 15:07 schrieb David Jencks:

> Hi,
> 
> We've now run into the problem I found in an integration test in a "real life" server.
> 
> It would be really great if you could review and either apply or comment on these patches:
> 
> FELIX-3345-2.diff fixes the initial set of concurrency problems we found
> FELIX-3345-5.diff fixes the problem found in the integration test

Great. Thanks. I have applied them in Revs. 1298266 and 1298267. Ignored the other patches
assumed them to be not needed, correct ?

> 
> In addition I've updated the build from the truly antique pax components currently used
to the latest ones and proposed running the integration tests with the failsafe plugin.  These
patches are in
> FELIX-3371-3.diff 
> or
> FELIX-3345-6.diff
> FELIX-3371-1.diff
> FELIX-3371-2.diff

Applied FELIX-3371-3.diff in Rev. 1298268

Ignored the FELIX-3371-1.diff and FELIX-3371-2.diff as well as the missing FELIX-3345-6.diff.

> 
> I really don't know how to investigate these problems further without more advice.  
At the moment I think that the best way to fix them is with null and invalidity checks, as
done in my patches.  If anyone has other ideas I can try to check them out in the environment
where we see these problems intermittently.

I think this is really best we can do. We cannot prevent these concurrency issues at all without
risking deadlocks, which IMHO is much worse.

Thanks alot for the diligent work you put into this. Very much appreciated.

Regards
Felix

> 
> thanks!
> david jencks
> 
> 
> 
> On Feb 21, 2012, at 11:56 PM, David Jencks wrote:
> 
>> Hi Felix,
>> 
>> Sorry for the delay, I've been struggling to come up with an independent way to reproduce
this and finally have something similar in an integration test which I've attached to the
jira with a stack trace.
>> 
>> By messing with the timing I've gotten into a state where we're trying to log on
a BundleComponentActivator that has been shut down.  The activity being logged is different,
but the NPE is the same.  Admittedly I added the log statement in question....
>> 
>> The basic scenario is two threads shutting stuff down concurrently, and an unbind
method that takes a long time to return.
>> 
>> bundle A has a service s1
>> bundle B has a service s2 with a reference to s1, and an unbind method that takes
a long time.
>> 
>> Stop A in thread 1; the thread gets stuck in unbind of s2.
>> 
>> Stop B in thread 2; this shuts down B so it's BundleComponentActivator no longer
can log (the NPE).
>> 
>> If the unbind method in thread 1 now returns, it will try to log in B's BCA which
has no log service tracker.
>> 
>> At the moment I'm inclined to think that there is no way to prevent these kinds of
races between bundles shutting down and the best thing to do is null checks in many places
and catching exceptions from attempts to use stopped bundle contexts; this is what my first
two patches do.
>> 
>> BTW to get my test to work at all I started upgrading scr to more recent pax bundles.
 I think this would slightly simplify the code.  Also I think the more modern way to run integration
tests is with the maven-failsafe-plugin rather than a separate surefire plugin execution.
 Would you be interested in a patch for this?
>> 
>> thanks
>> david jencks
>> 
>> 
>> On Feb 16, 2012, at 12:41 AM, Felix Meschberger wrote:
>> 
>>> Hi David,
>>> 
>>> Thanks for reporting. I have seen the reported issue. I will look into it ASAP.
>>> 
>>> My guts feeling tells me that some issues are just a question of doing null-checks
properly while others might be more involved ...
>>> 
>>> I assume you have a scenario where you can reproduce reliably ?
>>> 
>>> Regards
>>> Felix
>>> 
>>> Am 15.02.2012 um 07:55 schrieb David Jencks:
>>> 
>>>> see FELIX-3345
>>>> 
>>>> We've been seeing intermittent exceptions from SCR which generally seem to
look like trying to unget a service on a bundle context, BundleComponentActivator, ComponentManager,
or DependencyManager that are shut down or being shut down.  I think there are 2 threads shutting
bundles down at once.  I'm not making much progress investigation exactly how this happens
so I'd really appreciate it if one of the experts could take a look at the stack traces in
the issue and attempt to guess whether there's a real concurrency bug or if the situations
we're seeing are expected when more than one thread is shutting down bundles at once, and
the "don't throw an exception" patch I provided would be appropriate.
>>>> 
>>>> To try to pique your interest here is one of the stack traces:
>>>> 
>>>> Stack Dump = org.osgi.framework.ServiceException: Exception in org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService()

>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:287)

>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistrationImpl.releaseService(ServiceRegistrationImpl.java:562)

>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceRegistry.releaseServicesInUse(ServiceRegistry.java:665)

>>>> at org.eclipse.osgi.framework.internal.core.BundleContextImpl.close(BundleContextImpl.java:91)

>>>> at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:514)

>>>> at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:565)

>>>> at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1161)

>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:595)

>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:257)

>>>> at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:215)

>>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:284)

>>>> at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:691)

>>>> at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:598)

>>>> at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:261)

>>>> at java.lang.Thread.run(Thread.java:680) 
>>>> Caused by: java.lang.NullPointerException 
>>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:614)

>>>> at org.apache.felix.scr.impl.BundleComponentActivator.log(BundleComponentActivator.java:589)

>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager.log(AbstractComponentManager.java:633)

>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.log(AbstractComponentManager.java:1000)

>>>> at org.apache.felix.scr.impl.manager.AbstractComponentManager$State.ungetService(AbstractComponentManager.java:964)

>>>> at org.apache.felix.scr.impl.manager.DelayedComponentManager.ungetService(DelayedComponentManager.java:114)

>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse$3.run(ServiceUse.java:277)

>>>> at java.security.AccessController.doPrivileged(Native Method) 
>>>> at org.eclipse.osgi.internal.serviceregistry.ServiceUse.releaseService(ServiceUse.java:275)

>>>> ... 14 more 
>>>> 
>>>> I think the DelayedComponentManager.State here is Disposed but we've also
seen this trace with state Active but the bundle context stopped (so ungetting throws an exception).
>>>> 
>>>> many thanks
>>>> david jencks
>>>> 
>>> 
>> 
> 


Mime
View raw message