activemq-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hendley, Sam" <Sam.Hend...@sensus.com>
Subject Faster detection of failure on network connections in "cable pull" scenario
Date Fri, 20 Feb 2015 20:43:09 GMT
We are doing "break testing" on our clustered deployment and are running into an issue with
activemq being slow to notice a node that has been "cable pulled". We are seeking advice on
how to best configure the connections between the brokers to notice this error quickly.

We have a producer box and two consumer boxes connected with a network connector with the
below options. Our producer is enqueing messages to a distributed queue that is serviced by
both of the consumer boxes. Normally traffic is round-robbined between the two consumer boxes
correctly. In all of these tests we are producing messages at a constant rate.

Prodcuer configuration:
<networkConnector name="producer-to-consumer1" uri="static:tcp://consumer1:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
/>
<networkConnector name="producer-to-consumer2" uri="static:tcp://consumer2:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
/>

Consumer configuration:
<networkConnector name="consumer-to-produce" uri="static:tcp://producerbox:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
/>

If we do a "cable pull" on one of the consumer boxes we it can take many minutes before the
broker notices that our connection is down. Eventually we did get a failure from the writeTimeoutFilter
but instead of after two seconds like expected, we didn't get the failure for nearly 5 minutes.
When it finally does trip all of the messages that had been enqueued for the bad consumer
are correctly resent to the good consumer and all the future traffic is switched over to the
good consumer.

Below is the log from the producer side. We pulled the cable at 14:20:36 and we get the expected
inactivity failure error on the "reverse bridge" from the consumer broker came a few seconds
later. Our "forward bridge" doesn't fail for around 5 minutes.

2015-02-20 14:20:41,798 | WARN  | Transport Connection to: tcp://10.22.19.206:44208 failed:
org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>3000)
long: tcp://10.22.19.206:44208 | org.apache.activemq.broker.TransportConnection.Transport
| ActiveMQ InactivityMonitor Worker
2015-02-20 14:25:15,276 | WARN  | Forced write timeout for:tcp://10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter
| WriteTimeoutFilter-Timeout-1
2015-02-20 14:25:15,278 | WARN  | Caught an exception processing local command | org.apache.activemq.network.DemandForwardingBridgeSupport
| ActiveMQ BrokerService[scaleha-gw2] Task-37
java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176)
        at org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304)
        at org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286)
        at org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104)
        at org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73)
        at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
        at org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997)
        at org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206)
        at org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116)
        at org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
        at org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138)
        at org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130)
        at org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107)
        at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
        at org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370)
        at org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889)
        at org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935)
        at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
        at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I believe I know what is occurring but we don't know how to fix it. What we believe is occurring
is the soWriteTimeout filter isn't firing because each socket write is actually succeeding
and returning quickly but the data is being queued into the linux socket caches. The actual
write to the socket doesn't block until we fill those buffers. We validated this by watching
netstat as we ran the test and saw that the send-q slowly filled up as we ran traffic and
then once it stopped growing the socket write timeout tripped a few seconds later.

tcp        0        0 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED
  - // normal
tcp        0     7653 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED
  - // early after cable pull
tcp        0    53400 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED
  - // two minutes after cable pull
tcp        0   129489 10.22.19.205:39466          10.22.19.206:61616          FIN_WAIT1  
  - // final state after writeTimeout fires

I tried to rerun this same experiment without any messages flowing and the detection took
15 minutes. I think this took much longer because the number of bytes being written to the
socket was much lower so we didn't fill the buffer as quickly.

I was expecting the InactivityMonitorTImeout to fire to protect us from this case but I think
it considers the connection to be active every time it dispatches a message to that consumer
so that timer never fires either.

Normally this is handled with an application level timeout. Presumabley in this case we should
be waiting for the acknowledgement of receipt of the message from the other broker. The stats
appear to show the messages as dispatched but not dequeued while we keep the consumer box
off the network. I was expecting that the inactivity timer was looking the successful completion
of a message, not just it's send.

Is there a setting somewhere I am missing? It seems like this should be a relatively common
failure mode, maybe other people have enough data flowing that they fill those buffers incredibly
quickly? We have investigated using keepAlive but in general those timeouts are still too
slow for our purposes.

Sam Hendley


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message