activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stirling Chow (JIRA)" <>
Subject [jira] [Reopened] (AMQ-4160) DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge failure or continued attempts to re-connect existing bridges
Date Tue, 13 Nov 2012 22:40:12 GMT


Stirling Chow reopened AMQ-4160:

    Regression:   (was: Unit Test Broken)

Reopening this as the activeEvents map was not being cleared when the network connector was
stopped; as a result, restarting the network connector would not result in the bridges being

The updated patch contains the following changes:
- Unit test updated to include verification of handleStop
- Unit test updated to allow inclusion of the previously removed test; now verifies pre and
post patch
- Improved waitForBridge/hasBridge methods added to JmsMultipleBrokersTestSupport --- the
existing waitForBridgeFormation methods are based on NetworkConnector.bridges, which is updated
prior to connections (and thus the bridge) actually being formed --- the new waitForBridge
method waits for broker info to be exchange before declaring that a bridge has foremd
> DiscoveryNetworkConnector can lose track of active bridges, resulting in permanent bridge
failure or continued attempts to re-connect existing bridges
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>                 Key: AMQ-4160
>                 URL:
>             Project: ActiveMQ
>          Issue Type: Bug
>            Reporter: Stirling Chow
>            Assignee: Timothy Bish
>            Priority: Critical
>             Fix For: 5.8.0
>         Attachments: AMQ4160_2.patch, AMQ4160.patch,
> Symptom
> =======
> {{DiscoveryNetworkConnector}} is not thread-safe, and as a result, race conditions allow
the {{bridges}} data structure to become corrupt and not accurately represent the bridges
that exist.
> Because {{bridges}} is used to control whether a discovery event will result in a bridge
creation attempt, if it is not accurate, two results are possible:
> # A discovery event will be ignored even though its corresponding bridge is not active;
as a result, the bridge will never be established
> # A discovery event will be processed even though its corresponding bridge is active;
as a result, the bridge creation will fail (because of duplicate connections), and be indefinitely
retried, generating many WARN/ERROR log messages
> Cause
> =====
> {{DiscoveryNetworkConnector}} updates a {{bridges}} hashmap whenever a bridge is created
or removed:
> {}
> public void onServiceAdd(DiscoveryEvent event) {
> ...
>         // Should we try to connect to that URI?
>         synchronized (bridges) {
>             if( bridges.containsKey(uri) ) {
>                 if (LOG.isDebugEnabled()) {
>                     LOG.debug("Discovery agent generated a duplicate onServiceAdd event
for: "+uri );
>                 }
>                 return;
>             }
>         }
> ...
>         NetworkBridge bridge = createBridge(localTransport, remoteTransport, event);
>         try {
>             bridge.start();
>             synchronized (bridges) {
>                 bridges.put(uri, bridge);
>             }
> ...
> }
> public void onServiceRemove(DiscoveryEvent event) {
>     String url = event.getServiceName();
>     if (url != null) {
>         URI uri;
>         try {
>             uri = new URI(url);
>         } catch (URISyntaxException e) {
>             LOG.warn("Could not connect to remote URI: " + url + " due to bad URI syntax:
" + e, e);
>             return;
>         }
>         synchronized (bridges) {
>             bridges.remove(uri);
>         }
>     }
> }
> {code}
> There are a number of problems:
> # The check-and-set for adding an entry {{bridges}} is not atomic.  As a result, concurrent
attempts to add a bridge to the same URL can be allowed to proceed to bridge creation.  Only
one of the bridges will be established (the other will fail when it attempts to add duplicate
connections); however, the failure of the second bridge will result in a call to {{onServiceRemove(...)}}
that will remove the single/shared {{bridges}} entry.
> # The bridge is started before an entry is added to {{bridges}}.  Since bridges are multi-threaded,
it is possible that an exception will be handled by a different thread at some time between
{{bridge.start()}} and {{bridges.put(uri, bridge}}.  In processing this exception, the thread
will call {{onServiceRemove(...}}, which will remove the (non-existent) {{bridges}} entry.
 Subseqently, {{bridges.put(uri, bridge)}} is executed, and adds the entry to {{bridges}}
even though it is now stale and represents a failed bridge.  Subsequent attempts to restore
the bridge will be ignored, and no active bridge will be created.
> The lack of thread-safety in {{DiscoveryNetworkConnector}} is exacerbated by AMQ-4159,
which can result in many concurrent attempts to establish a bridge to the same URL.  AMQ-4159
also described how multiple calls can be made to {{onServiceRemove(...)}}, which can result
in mal-behaviour similar to the second case described above (i.e., unexpected removal of a
bridge that is active).
> Solution
> ========
> The attached patch attempts to restore some thread-safety to {{DiscoveryNetworkConnector}}
by making the check-and-set atomic and adding the entry to {{bridges}} prior to starting the
bridge.  An additional structure, {{activeEvents}}, keeps track of the event that represents
the current attempt to establish a bridge to a given remote URL --- this is used to prevent
multiple calls to {{onServiceRemove(...)}} from removing a {{bridges}} entry that was *not*
added by the corresponding discovery event.  
> This patch is a band aid to the design flaws in {{DiscoveryNetworkConnector}}, and a
refactoring of the connector should be considered.  In particular, there is a tight-coupling
between {{DiscoveryNetworkConnector}} and {{SimpleDiscoveryAgent}} that is not evident through
their interfaces.  For example, {{DiscoveryNetworkConnector}} assumes that the call to {{discoveryAgent.serviceFailed(...)}}
will result in a call back to {{DiscoveryNetworkConnector.onServiceRemove(...)}}.  The call
to {{onServiceRemove(...)}} is necessary to clean up the resources used by the bridge, but
that requirement is not explicit, and therefore easily missed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message