felix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierre De Rop (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FELIX-5471) Ensure that unbound services are always handled synchronously
Date Mon, 02 Jan 2017 08:20:58 GMT

    [ https://issues.apache.org/jira/browse/FELIX-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792465#comment-15792465
] 

Pierre De Rop commented on FELIX-5471:
--------------------------------------

Hello Jeroen;

The timeout I introduced last friday was because I wanted to avoid a corner case where lost
dependencies could be stopped before calling "removed" callbacks (I will explain this below).
But to be honest, I should not have added this timeout because I realize that it may generate
some deadlocks. I have added the FELIX5471_CyclicDependencyTest.java which demonstrates the
deadlock:
So, running this test with the timeout guard in the ComponentImpl.schedule method (ComponentImpl.java,
in revision 1776641) produces a deadlock, and after 30 seconds, we see this message log:

<code>
WARN - pool-3-thread-2 : [pool-3-thread-2] task could not be scheduled timely in component
org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$A (exception: java.util.concurrent.TimeoutException)
WARN - pool-3-thread-1 : [pool-3-thread-1] task could not be scheduled timely in component
org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$B (exception: java.util.concurrent.TimeoutException)
</code>

So, for now, I have committed a new patch (in revision 1776903.) which includes the FELIX5471_CyclicDependencyTest.java
test. The patch does not use the timeout in the ComponentImpl.schedule method anymore.

Now, actually, the real patch which (partially) resolves the issue you found is that now,
when we are
handling a REMOVED service event, then if a threadpool is used we try to not reschedule the
REMOVED event it it and we try to call the handleRemoved method synchronously. Initially,
the problem was that if a threadpool was used, then the REMOVED event was always scheduled
in it, and the event was always handled asynchronously.

So, let me try to explain deeper what was going on using the scenario you gave in the users
mailing list (that is: M depends on X, and M.remove(X) is called but at this point, X is already
stopped):

1) So, X is being unregistered from the OSGi service registry. That is: the ComponentImpl
for X is currently invoking m_registration.unregister() in the ComponentImpl.unregisterService()
method.

2) Then, the framework callbacks the tracker of the M component, which ends up calling the
ComponentImpl.handleEvent() method with the event for X being unregistered.
At this point, we are still in the call stack of the unregisterService for X (see step 1).
But the bug is that if we are using a threadpool, then the Component.handleRemoved() method
was always executed asynchronously in the threadpool.

3) So, while the Component.handleRemoved method (for M) is being scheduled in the threadpool,
then we return from the handleEvent method and then
the ComponentImpl.unregisterService() method returns (from step 1). So, then, at this point
the thread from (1) ends up calling the X.stop() method, but
at a point where M.unbind(X) method has not been yet been scheduled and executed in the threadpool.
So, in this case, X.stop() is called , and then, M.remove(X) is
called.

So, now, the Component.handleRemoved() method is not rescheduled in the threadpool (if possible).
see the schedule method which invokes the DispatchExecutor.execute method:

{code}
    // Try to execute the task from the current thread if the threadpool is not currently
running our queue.
    ((DispatchExecutor) exec).execute(task, false);
{code}

So, the patch will try to not schedule the Component.handleRemoved()  in the threadpool, but
instead will invoke the method synchronously.

Unfortunately, the patch is not perfect, because the DispatchExecutor won't invoke the task
synchronously if the queue for the Component is currently executing another task (like a concurrent
service ADD event). This may happen in a highly concurrent situations, and for now, I don't
see an easy solution in order to make sure M.remove(X) is called before X.stop().

For example, if during the unregistration of the X service, then at the same time another
thread is registering a Z service on which M depends on, then it is possible that
at the time the ComponentImpl for M is called in handleEvent for the REMOVED X event , then
the threadpool may possibly be currently handling the ADDED Z event.
In this case, since the state machine is single threaded, it won't be possible to handle the
REMOVED X event synchronously, and the Component.handleRemoved() method call will be
scheduled in the threadpool, which will be executed after the handling of the Z ADDED event,
but in this case, X.stop() will possibly be called before M.remove(X) is invoked.

Please notice that this can also happen even if no threadpool is used (antoher master thread
me be running the queue while we are handling the REMOVED X event).

So, this is for this reason that I also introduced last friday the timeout guard in the schedule
method (if the synchronous flag is true).
However, it does not work because it may introduce a nasty deadlock which is reproduced in
the FELIX5471_CyclicDependencyTest.java test.

So, for now, I have simply removed the usage of the timeout guard.

it means that for most situations (whether or not you are using a threadpool), the ordering
will be guaranteed.
But if there is a highly concurrent situation where some services are unregistered while others
are added from multiple distinct threads, then the ordering can't be guaranteed in all cases.
Always ensuring the ordering guarantee seems that we would have to rework the state machine
thread model with complex internal locking policy, and I'm not sure it's worth doing this.
Anyway, I will try to contact Marcel and see what he is thinking and will get back to this
issue in case some more rework has to really be done.

Now,  I would also be interested to have your opinion ? do you think what is committed is
reasonable ?

thanks a lot Jeroen;
/Pierre


> Ensure that unbound services are always handled synchronously
> -------------------------------------------------------------
>
>                 Key: FELIX-5471
>                 URL: https://issues.apache.org/jira/browse/FELIX-5471
>             Project: Felix
>          Issue Type: Bug
>          Components: Dependency Manager
>    Affects Versions: org.apache.felix.dependencymanager-r1
>            Reporter: Pierre De Rop
>            Assignee: Pierre De Rop
>             Fix For: org.apache.felix.dependencymanager-r9
>
>
> When a component loses a service dependency, it should handle the lost service synchronously.
For example, if service A loses a dependency on B (because B is being unregistered),  then
A.remove(B) should be called synchronously (when B is being unregistered from the service
registry), else the A.remove(B) callback could possibly be invoked while B is already unregistered
and stopped.
> Currently, unbound services may be handled asynchronously if DM is used in a concurrent
mode (using a threadpool). And even if no threadpool is used, the issue may happen if there
is a highly concurrent situation where services are registered/removed concurrently from multiple
threads.
> So, a patch should be done in order to ensure that a service dependency remove event
is always handled synchronously (especially if DM is used with a threadpool).
> I will provide a testcase soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message