cxf-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amichai Rothman (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DOSGI-173) unregistering an exported service does not remove it from zookeeper (and remote clients)
Date Thu, 18 Apr 2013 14:51:17 GMT
Amichai Rothman created DOSGI-173:
-------------------------------------

             Summary: unregistering an exported service does not remove it from zookeeper
(and remote clients)
                 Key: DOSGI-173
                 URL: https://issues.apache.org/jira/browse/DOSGI-173
             Project: CXF Distributed OSGi
          Issue Type: Bug
    Affects Versions: 1.5
            Reporter: Amichai Rothman


I have some bundles exporting and consuming services, running on two hosts. I've noticed more
than once that while stopping and starting different bundles on the two hosts (just playing
around with them manually to see how robust the distributed system is), at some point one
of the hosts doesn't see that a service it was using from the other host is down. Connecting
to ZooKeeper directly, I see the node for that service is still there, i.e. the service was
not properly removed from ZK even though the bundle is stopped and service is gone.

Investigating this is a bit tricky, since it involves various trackers, endpoint listeners
and service listeners and there is not enough code documentation to understand what the intended
flow is... however I've found a few interesting related findings that may point at the solution:

1. Following the logs and some debugging, it appears that the problem is not with the discovery.zookeeper
package/bundle itself, since the endpoint removed event never gets there.

2. In EndpointListenerNotifier.notifyListenersOfRemoval(), the EndpointDescription appears
to be null, so there is never a filter match and the endpointRemoved callback is never triggered
on the EndpointListeners. This is because all of the ExportRegistrations are already closed
by the time they get there. It seems that the premature closing is done by the service tracker
created in ExportRegistrationImpl.startServiceTracker(). My guess is that the order in which
the service tracker and service listener (in TopologyManagerExport, which triggers the EndpointListenerNotifier)
receive the events is arbitrary depending on some race condition somewhere, which may explain
why this is an inconsistently reproducible bug. I would like to say that the solution is to
get rid of the service tracker altogether (it doesn't do anything else, and as a separate
bug, is never closed), but I'm not sure why it was introduced in the first place or if there
are any other scenarios in which it was necessary, so I really don't know what the proper
solution should be.

3. Another element that may have been masking this bug to some degree is the local discovery
bundle which was running, and during debugging I saw it triggering some EndpointListener removal
events which were picked up by the other components. I'm not entirely sure yet of what this
bundle does (I didn't find any mention of it on the website, and didn't get to the code yet),
but I just leave this bundle in the stopped state for now, with no visible effects on the
testing, making debugging easier.

4. An additional related issue which bugged me during a previous code review was that InterfaceMonitorManager.addInterest()
is closing and recreating an InterfaceMonitor every time it is invoked with an existing scope,
even though the old and new IMs monitor the same ZK node and are practically identical - so
why not just leave the old monitor running? This replacement causes a bunch of unnecessary
extra work (including several ZK server accesses), a flurry of unnecessary filter-matching
logs, and and unnecessary gap in monitoring for ZK changes. This also relates to the bug at
hand since InterfaceMonitor.close() also sends some EndpointListener notifications about the
endpoints being removed, which leaves some gaps in the registration coverage (before they
are re-added moments later) and might interact in some other unpredictable (at least to me)
way with the rest of the mechanism. It seems these IM close/start cycles sometimes occur tens
of times in a row.

To sum it up, there's definitely a bug occurring. When I tested a bit with fixes for both
potential causes above (IM stop/start replaced with a single start the first time a given
scope is encountered, and close invocation in service tracker removed) - I could no longer
recreate the bug, but I don't understand all the component interactions well enough to know
if there are any side effects, or why they were implemented this way in the first place (I
tend to assume there was a good reason for it which I'm unaware of).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message