From "Bertrand Delacretaz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FELIX-3067) Prevent Deadlock Situation in Felix.acquireGlobalLock
Date Tue, 27 Nov 2012 10:55:58 GMT

    [ https://issues.apache.org/jira/browse/FELIX-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504527#comment-13504527

Bertrand Delacretaz commented on FELIX-3067:

5.6 deadlock, stress test tool?
jenkins tests
Sling log markers

from53 test, not enough memory for 5.6, uses plain java instead of /usr/java/jdk1.6.0_35/bin/java

I can now reliably reproduce such deadlocks using my https://github.com/bdelacretaz/osgi-stresser
stress test tool - requires a few manual steps but generates deadlocks after just a few seconds
in my tests.

I'm using the Sling Launchpad for this, as that contains a number of bundles that can be uninstalled/started/stopped
(like crazy) to expose the problem. It looks like lots of package refreshes helps expose deadlocks
much quicker.

Here's my failure scenario:
# Build Sling from http://svn.apache.org/repos/asf/sling/trunk, making sure it's using the
Felix trunk's framework and scr modules (patch follows)
# Start Sling:
## cd launchpad/builder
## rm -rf sling (if needed to remove all previous state)
## java -jar target/org.apache.sling.launchpad-7-SNAPSHOT-standalone.jar
## Optionally add -Dsling.launchpad.log.level=4 to set OSGi log level to DEBUG, use with my
FELIX-3785 patch to log locking operations
# Build the https://github.com/bdelacretaz/osgi-stresser bundle and install at start level
1 (so that it doesn't stop itself) from /system/console
# Connect to the tool's command line using telnet 1234

At this point the tool's stress test tasks can be started using the commands described at
https://github.com/bdelacretaz/osgi-stresser - or simply use 

* r

to start all tasks, at which point the tool should display something like

OSGI stresser> * r
sl task running - cycle time -1000 msec - levels=[3, 45, 8, 19, 30]
rp task running - cycle time 5000 msec - max wait for packages refresh=10000
ss task running - cycle time 0 msec - bundle to stop and restart=org.apache.sling.junit.core
bu task running - cycle time -1000 msec - ignored symbolic names (patterns)=[commons, org.apache.felix,
slf4j, ch.x42, log, org.osgi]
up task running - cycle time 0 msec - bundle to update=org.apache.sling.junit.core
OSGI stresser> 

the tasks then do crazy things to the OSGi framework, but (IMO) according to spec so should
not cause any deadlocks.

The sling/logs/error.log shows what the tasks are doing, and a good way to detect the global/bundle
locks deadlock is to try to refresh /system/console, that will block if the locks cannot be
> Prevent Deadlock Situation in Felix.acquireGlobalLock
> -----------------------------------------------------
>                 Key: FELIX-3067
>                 URL: https://issues.apache.org/jira/browse/FELIX-3067
>             Project: Felix
>          Issue Type: Improvement
>          Components: Framework
>    Affects Versions: framework-3.0.7, framework-3.0.8, framework-3.0.9, framework-3.2.0,
framework-3.2.1, fileinstall-3.1.10
>            Reporter: Felix Meschberger
>         Attachments: FELIX-3067.patch
> Every now and then we encounter deadlock situations which involve the Felix.acquireGlobalLock
method. In our use case we have the following aspects which contribute to this:
> (a) The Apache Felix Declarative Services implementation stops components (and thus causes
service unregistration) while the bundle lock is being held because this happens in a SynchronousBundleListener
while handling the STOPPING bundle event. We have to do this to ensure the bundle is not really
stopped yet to properly stop the bundle's components.
> (b) Implementing a special class loader which involves dynamically resolving packages
which in turn uses the global lock
> (c) Eclipse Gemini Blueprint implementation which operates asynchronously
> (d) synchronization in application classes
> Often times, I would assume that we can self-heal such complex deadlck situations, if
we let acquireGlobalLock time out. Looking at the calles of acquireGlobalLock there seems
to already be provision to handle this case since acquireGlobalLock returns true only if the
global lock has actually been acquired.
> This issue is kind of a companion to FELIX-3000 where deadlocks involve sending service
registration events while holding the bundle lock.

