activemq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mario Lukica (JIRA)" <j...@apache.org>
Subject [jira] Created: (AMQ-1855) bridge reconnection stops because of race in SimpleDiscoveryAgent
Date Fri, 11 Jul 2008 13:32:01 GMT
bridge reconnection stops because of race in SimpleDiscoveryAgent
-----------------------------------------------------------------

                 Key: AMQ-1855
                 URL: https://issues.apache.org/activemq/browse/AMQ-1855
             Project: ActiveMQ
          Issue Type: Bug
          Components: Connector
    Affects Versions: 4.1.2
            Reporter: Mario Lukica


I believe there is a race condition in SimpleDiscoveryAgent which can cause subsequent bridge
restart to fail, without starting new thread that should restart a bridge. As a consequence,
network bridge is never restarted.

Following scenario leads to this:
1. bridge is disconnected (e.g. local error: org.apache.activemq.transport.InactivityIOException:
Channel was inactive for too long)
2. bridge is disposed in separate thread in DemandForwardingBridge.serviceLocalException
3. SimpleDiscoveryAgent.serviceFailed is called which starts up another thread which calls
DiscoveryNetworkConnector.onServiceAdd which tries to restart bridge
4. bridge startup can cause javax.jms.InvalidClientIDException: Broker: some_broker2 - Client:
NC_some_broker1_inboundlocalhost already connected (this one is caused by race condition with
thread disposing the bridge, since given client subscription should be removed by thread disposing
the bridge (step 2)
5. this causes invocation of DemandForwardingBridge.serviceLocalException (this call can be
made asynchronously, while previous bridge startup is still in progress)

As a consequence, multiple threads can end up calling SimpleDiscoveryAgent.serviceFailed simultaneously.

serviceFailed will call DiscoveryNetworkConnector.onServiceAdd which will try to reconnect
bridge. Reconnect logic is guarded by 
if( event.failed.compareAndSet(false, true) ) 

which tries to ensure that only a single thread is reconnecting bridge at some point.
{code}
    public void serviceFailed(DiscoveryEvent devent) throws IOException {
    	
        final SimpleDiscoveryEvent event = (SimpleDiscoveryEvent) devent;
        if( event.failed.compareAndSet(false, true) ) {
        	
			listener.onServiceRemove(event);
	    	Thread thread = new Thread() {
	    		public void run() {
	
	
	    			// We detect a failed connection attempt because the service fails right
	    			// away.
	    			if( event.connectTime + minConnectTime > System.currentTimeMillis()  ) {
	    				
	    				event.connectFailures++;
	    				
	    				if( maxReconnectAttempts>0 &&  event.connectFailures >= maxReconnectAttempts
) {
	    					// Don' try to re-connect
	    					return;
	    				}
	    				
		                synchronized(sleepMutex){
		                    try{
		                    	if( !running.get() )
		                    		return;
		                    	
		                        sleepMutex.wait(event.reconnectDelay);
		                    }catch(InterruptedException ie){
                                Thread.currentThread().interrupt();
		                       return;
		                    }
		                }
	
		                if (!useExponentialBackOff) {
		                    event.reconnectDelay = initialReconnectDelay;
		                } else {
		                    // Exponential increment of reconnect delay.
		                    event.reconnectDelay*=backOffMultiplier;
		                    if(event.reconnectDelay>maxReconnectDelay)
		                        event.reconnectDelay=maxReconnectDelay;
		                }
		                
	    			} else {
	    				event.connectFailures = 0;
	                    event.reconnectDelay = initialReconnectDelay;
	    			}
	    			                    			
	            	if( !running.get() )
	            		return;
	            	
	    			event.connectTime = System.currentTimeMillis();
	    			event.failed.set(false);
	    			
	    			listener.onServiceAdd(event);
	    		}
	    	};
	    	thread.setDaemon(true);
	    	thread.start();
        }
    }
{code}

Prior to calling DiscoveryNetworkConnector.onServiceAdd, event.failed is set to false (T1),
and it's possible for some other thread (T2) to enter block guarded by if( event.failed.compareAndSet(false,
true) ) , while reconnect process has already begun by first thread. T2 can satisfy condition:
if( event.connectTime + minConnectTime > System.currentTimeMillis()  )  and will enter

sleepMutex.wait(event.reconnectDelay), but still holding event.failed == true (causing all
other calls to serviceFailed not to start thread that will reconnect bridge).

If first thread (T1) fails to reconnect bridge (e.g because of InvalidClientIDException described
in step 4), it will not schedule new thread to restart broker (and call DiscoveryNetworkConnector.onServiceRemove,
and cleanup DiscoveryNetworkConnector.bridges) because of event.failed == true, and T2 still
waiting (default 5 sec). When T2 wakes up from wait, it will try to restart broker and fail
because of following condition in DiscoveryNetworkConnector:
{code}
            if (    bridges.containsKey(uri) 
                    || localURI.equals(uri) 
                    || (connectionFilter!=null && !connectionFilter.connectTo(uri))
                    )
                return;
{code}

bridges.containsKey(uri) will be true (thread T1 added it while unsuccessfully trying to reconnect
bridge), and T2 will return from DiscoveryNetworkConnector.onServiceAdd and will not start
bridge. 
No additional attempt to reconnect bridge will be made, since T2 held event.failed == true,
effectively ignoring SimpleDiscoveryAgent.serviceFailed calls from other threads processing
local or remote bridge exceptions.

End result:
- DiscoveryNetworkConnector.bridges contains bridge that is disposed and prevents all other
attempts to restart bridge (onServiceAdd always returns because bridges.containsKey(uri) ==
true) 
- SimpleDiscoveryAgent doesn't try to reconnect the bridge (T2 was a last attempt which returned
without restarting the bridge - SimpleDiscoveryAgent.serviceFailed is not called again, since
bridge is not started

I think that synchronization of threads processing bridge exceptions and entering SimpleDiscoveryAgent.serviceFailed
should be verified and/or improved.
Also, InvalidClientIDException is relatively common (at least on multicore machines, e.g.
Solaris T2000), maybe ConduitBridge.serviceLocalException (which starts another thread doing
ServiceSupport.dispose(DemandForwardingBridgeSupport.this)), should be changed to wait a bit
for bridge disposal to finish (e.g. sleep for some time) and then try to restart a bridge
- waiting for a second more to restart a bridge is better then not to start it at all

I've seen this problem in 4.1.0 and 4.1.2, but I think it can occur in 5.1 and 5.2 trunk (SimpleDiscoveryAgent.serviceFailed
and DiscoveryNetworkConnector.onServiceAdd are more or less the same, just using ASYNC_TASKS
to execute asynchronous calls, instead of starting new threads directly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message