activemq-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <>
Subject [jira] [Commented] (AMQ-6248) Failover - transport connected to one broker fails due to error in connection to another broker
Date Mon, 18 Apr 2016 13:54:25 GMT


ASF subversion and git services commented on AMQ-6248:

Commit 3560d9123dbeecc97f075a6812f15b2484836275 in activemq's branch refs/heads/master from
[;h=3560d91 ]

AMQ-6248 fix logging statement to use the connected URI.

> Failover - transport connected to one broker fails due to error in connection to another
> -----------------------------------------------------------------------------------------------
>                 Key: AMQ-6248
>                 URL:
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Transport
>            Reporter: Petr Janata
>            Assignee: Timothy Bish
>              Labels: failover, race-condition
>         Attachments: AMQ-6248.patch.svndiff
> There is a bug in the  {{FailoverTransport}} which is triggered by a race condition.
The client log contains message:
> {{WARN | ActiveMQ Transport: *URI1* \[FailoverTransport] Transport (*URI2*) failed, attempting
to automatically reconnect}}
> The exact impact on client failover differs with each setup and environment. In our case
this forced client to infinitely switch between two available brokers.
> Assume client is configured to use broker URL in form
> {{failover:(URI1,URI2)?randomize=false}}.
> Assume that broker with URI1 is down and the other broker URI2 is running fine. This
is normal master/slave setup. 
> Client tries to establish connection and the following happens:
> 1. URI1 is tried, it fails because this broker is not reachable (down or waiting slave)
> 2. URI2 is tried, it succeeds because this broker is currently the 'master'
> 3. Exception from thread of transport to URI1 causes failure in transport to URI2
> 4. Try another transport in the list. Oh wait, its URI1 -> go to 1.
> Impact for different configurations might not be that severe. But unfortunately in our
case we were not able to avoid this bug no matter the configuration. For example {{randomize=true}}
helped a little, but still the inifinite loop happens 1/2 of the time.
> The bug is caused by a single shared instance {{myTransportListener}} of {{TransportListener}}
in {{FailoverTransport}} class. {{doReconnect()}} tries to start transport to URI1 and registers
the listener on it. Transport fails to start and the next transport to URI2 is tried. But
the listener is not unregistered from the failed transport URI1. Failures that happen on transport
URI1 may call in its own thread the listener method {{onException()}}. This call will get
to {{handleTransportFailure()}} where it waits for the {{reconnectMutex}}. The reconnect task
thread continues, establishes Transport URI2, sets it to {{connectedTransport}}=URI2, releases
the reconnectMutex. The thread of transport URI1 unblocks in handleTransportFailure() and
destroys the connectedTransport=URI2.
> I have created a patch against version 5.11 that deals specifically with this problem.
> The change is that instead of the single shared myTrasnportListener instance there is
a new listener created for each new transport.
> Each new listener keeps reference to the transport it was assigned to. The listener will
cause failover only if the exception is coming from the transport which is currently connected.
> I didn't care about the other methods of the listener, but these probably need the same
> This bug is present in all versions from version 4.0 (I didn't go deeper). The idea in
the patch should be applicable for all versions.
> Btw. log message mentioned in AMQ-4986 contains the same URI1 vs URI2 problem.

This message was sent by Atlassian JIRA

View raw message