Mailing-List: contact dev-help@felix.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@felix.apache.org
Date: Mon, 2 Jan 2017 08:20:58 +0000 (UTC)
From: "Pierre De Rop (JIRA)" <jira@apache.org>
To: dev@felix.apache.org
Message-ID: <JIRA.13031278.1483107051000.623305.1483345258350@Atlassian.JIRA>
In-Reply-To: <JIRA.13031278.1483107051000@Atlassian.JIRA>
References: <JIRA.13031278.1483107051000@Atlassian.JIRA> <JIRA.13031278.1483107051187@arcas>
Subject: [jira] [Commented] (FELIX-5471) Ensure that unbound services are
 always handled synchronously
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 02 Jan 2017 08:21:05 -0000


    [ https://issues.apache.org/jira/browse/FELIX-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792465#comment-15792465 ] 

Pierre De Rop commented on FELIX-5471:
--------------------------------------

Hello Jeroen;

The timeout I introduced last friday was because I wanted to avoid a corner case where lost dependencies could be stopped before calling "removed" callbacks (I will explain this below).
But to be honest, I should not have added this timeout because I realize that it may generate some deadlocks. I have added the FELIX5471_CyclicDependencyTest.java which demonstrates the deadlock:
So, running this test with the timeout guard in the ComponentImpl.schedule method (ComponentImpl.java, in revision 1776641) produces a deadlock, and after 30 seconds, we see this message log:

<code>
WARN - pool-3-thread-2 : [pool-3-thread-2] task could not be scheduled timely in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$A (exception: java.util.concurrent.TimeoutException)
WARN - pool-3-thread-1 : [pool-3-thread-1] task could not be scheduled timely in component org.apache.felix.dm.itest.api.FELIX5471_CyclidDependencyTest$B (exception: java.util.concurrent.TimeoutException)
</code>

So, for now, I have committed a new patch (in revision 1776903.) which includes the FELIX5471_CyclicDependencyTest.java test. The patch does not use the timeout in the ComponentImpl.schedule method anymore.

Now, actually, the real patch which (partially) resolves the issue you found is that now, when we are
handling a REMOVED service event, then if a threadpool is used we try to not reschedule the REMOVED event it it and we try to call the handleRemoved method synchronously. Initially, the problem was that if a threadpool was used, then the REMOVED event was always scheduled in it, and the event was always handled asynchronously.

So, let me try to explain deeper what was going on using the scenario you gave in the users mailing list (that is: M depends on X, and M.remove(X) is called but at this point, X is already stopped):

1) So, X is being unregistered from the OSGi service registry. That is: the ComponentImpl for X is currently invoking m_registration.unregister() in the ComponentImpl.unregisterService() method.

2) Then, the framework callbacks the tracker of the M component, which ends up calling the ComponentImpl.handleEvent() method with the event for X being unregistered.
At this point, we are still in the call stack of the unregisterService for X (see step 1).
But the bug is that if we are using a threadpool, then the Component.handleRemoved() method was always executed asynchronously in the threadpool.

3) So, while the Component.handleRemoved method (for M) is being scheduled in the threadpool, then we return from the handleEvent method and then
the ComponentImpl.unregisterService() method returns (from step 1). So, then, at this point the thread from (1) ends up calling the X.stop() method, but
at a point where M.unbind(X) method has not been yet been scheduled and executed in the threadpool. So, in this case, X.stop() is called , and then, M.remove(X) is
called.

So, now, the Component.handleRemoved() method is not rescheduled in the threadpool (if possible). see the schedule method which invokes the DispatchExecutor.execute method:

{code}
    // Try to execute the task from the current thread if the threadpool is not currently running our queue.
    ((DispatchExecutor) exec).execute(task, false);
{code}

So, the patch will try to not schedule the Component.handleRemoved()  in the threadpool, but instead will invoke the method synchronously.

Unfortunately, the patch is not perfect, because the DispatchExecutor won't invoke the task synchronously if the queue for the Component is currently executing another task (like a concurrent service ADD event). This may happen in a highly concurrent situations, and for now, I don't see an easy solution in order to make sure M.remove(X) is called before X.stop().

For example, if during the unregistration of the X service, then at the same time another thread is registering a Z service on which M depends on, then it is possible that
at the time the ComponentImpl for M is called in handleEvent for the REMOVED X event , then the threadpool may possibly be currently handling the ADDED Z event.
In this case, since the state machine is single threaded, it won't be possible to handle the REMOVED X event synchronously, and the Component.handleRemoved() method call will be
scheduled in the threadpool, which will be executed after the handling of the Z ADDED event, but in this case, X.stop() will possibly be called before M.remove(X) is invoked.

Please notice that this can also happen even if no threadpool is used (antoher master thread me be running the queue while we are handling the REMOVED X event).

So, this is for this reason that I also introduced last friday the timeout guard in the schedule method (if the synchronous flag is true).
However, it does not work because it may introduce a nasty deadlock which is reproduced in the FELIX5471_CyclicDependencyTest.java test.

So, for now, I have simply removed the usage of the timeout guard.

it means that for most situations (whether or not you are using a threadpool), the ordering will be guaranteed.
But if there is a highly concurrent situation where some services are unregistered while others are added from multiple distinct threads, then the ordering can't be guaranteed in all cases.
Always ensuring the ordering guarantee seems that we would have to rework the state machine thread model with complex internal locking policy, and I'm not sure it's worth doing this.
Anyway, I will try to contact Marcel and see what he is thinking and will get back to this issue in case some more rework has to really be done.

Now,  I would also be interested to have your opinion ? do you think what is committed is reasonable ?

thanks a lot Jeroen;
/Pierre


> Ensure that unbound services are always handled synchronously
> -------------------------------------------------------------
>
>                 Key: FELIX-5471
>                 URL: https://issues.apache.org/jira/browse/FELIX-5471
>             Project: Felix
>          Issue Type: Bug
>          Components: Dependency Manager
>    Affects Versions: org.apache.felix.dependencymanager-r1
>            Reporter: Pierre De Rop
>            Assignee: Pierre De Rop
>             Fix For: org.apache.felix.dependencymanager-r9
>
>
> When a component loses a service dependency, it should handle the lost service synchronously. For example, if service A loses a dependency on B (because B is being unregistered),  then A.remove(B) should be called synchronously (when B is being unregistered from the service registry), else the A.remove(B) callback could possibly be invoked while B is already unregistered and stopped.
> Currently, unbound services may be handled asynchronously if DM is used in a concurrent mode (using a threadpool). And even if no threadpool is used, the issue may happen if there is a highly concurrent situation where services are registered/removed concurrently from multiple threads.
> So, a patch should be done in order to ensure that a service dependency remove event is always handled synchronously (especially if DM is used with a threadpool).
> I will provide a testcase soon.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)