Return-Path: X-Original-To: apmail-activemq-users-archive@www.apache.org Delivered-To: apmail-activemq-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5AA9617D97 for ; Sun, 22 Feb 2015 23:18:15 +0000 (UTC) Received: (qmail 59281 invoked by uid 500); 22 Feb 2015 23:18:15 -0000 Delivered-To: apmail-activemq-users-archive@activemq.apache.org Received: (qmail 59234 invoked by uid 500); 22 Feb 2015 23:18:15 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Received: (qmail 59222 invoked by uid 99); 22 Feb 2015 23:18:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Feb 2015 23:18:14 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,HTML_OBFUSCATE_05_10,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tbain98@gmail.com designates 209.85.223.180 as permitted sender) Received: from [209.85.223.180] (HELO mail-ie0-f180.google.com) (209.85.223.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Feb 2015 23:18:10 +0000 Received: by iecrd18 with SMTP id rd18so19500918iec.8 for ; Sun, 22 Feb 2015 15:17:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=P4a+wtieW9SFjgD9y/ATd1RbhbmPOilyO4Sp0WLkxro=; b=RBPPq3Te/i2m8WDbLbXzGD/wb4lroXgFlk1gM4OofxmQalWlYXe652jaZQ6lKCByIn mOxIdfhTcrjQZnWQMKvIYRb4fUfcUE2Y8jJMIFsEdedqvBWamzcStrZszSBg7przri7E m4DFTajfpXzRBm6KKRmVxHj29hiqwfTQ2UZVv0lLackX3IXmwOdwDTbGXrDe7ewZH2mr WoIhzYCjSRBofJ3T2rwCHP+p1a4Lhcp6V2OGzKOac/emX8866CKI0YdYw/VYzy4g1MGW t9msCfdgF/8jdTtnrMe6fdarRZao/Pk1IDRQry8plbp2c38jUJgoxvm0SxCRCpRxeW5K UMzg== X-Received: by 10.107.165.21 with SMTP id o21mr10283305ioe.84.1424647069827; Sun, 22 Feb 2015 15:17:49 -0800 (PST) MIME-Version: 1.0 Sender: tbain98@gmail.com Received: by 10.50.235.34 with HTTP; Sun, 22 Feb 2015 15:17:28 -0800 (PST) In-Reply-To: References: From: Tim Bain Date: Sun, 22 Feb 2015 16:17:28 -0700 X-Google-Sender-Auth: Q3VWIM42qAprkz_aXhefM99QBiA Message-ID: Subject: Re: Faster detection of failure on network connections in "cable pull" scenario To: ActiveMQ Users Content-Type: multipart/alternative; boundary=001a1141cfea8707da050fb57e18 X-Virus-Checked: Checked by ClamAV on apache.org --001a1141cfea8707da050fb57e18 Content-Type: text/plain; charset=UTF-8 It sounds to me like what's going on is that the inactivity interval is tripping but either 1) the inactivity exception doesn't trigger the bridge to try to re-establish the connection, or 2) it tries to re-establish the connection but the connection establishment attempt send() succeeds (going into the OS socket buffer but not onto the network card) and nothing notices that it's been more than an acceptable amount of time without a response from the remote broker. Can you try to determine which of those it is by stepping through the code with a debugger? org.apache.activemq.transport.AbstractInactivityMonitor.readCheck() is where you would want the breakpoint, in the Runnable's call to onException(). Also, as far as I can tell, the inactivity interval algorithm correctly considers only reads to be an indication of connection liveness, so unless I'm missing something in the code of AbstractInactivityMonitor.readCheck(), I don't think your theory that the inactivity monitor isn't firing is right. (After all, the log you posted shows that it does in fact fire as expected at 2015-02-20 14:20:41,798; the problem is simply what happens after that.) I'm not at all clear what you meant by "I was expecting that the inactivity timer was looking the successful completion of a message"; if what you meant wasn't addressed by what I wrote above and is still relevant, please clarify the question. Tim On Fri, Feb 20, 2015 at 1:43 PM, Hendley, Sam wrote: > We are doing "break testing" on our clustered deployment and are running > into an issue with activemq being slow to notice a node that has been > "cable pulled". We are seeking advice on how to best configure the > connections between the brokers to notice this error quickly. > > We have a producer box and two consumer boxes connected with a network > connector with the below options. Our producer is enqueing messages to a > distributed queue that is serviced by both of the consumer boxes. Normally > traffic is round-robbined between the two consumer boxes correctly. In all > of these tests we are producing messages at a constant rate. > > Prodcuer configuration: > uri="static:tcp://consumer1:61616?soTimeout=2000&soWriteTimeout=2000&wireFormat.maxInactivityDuration=3000" > /> > uri="static:tcp://consumer2:61616?soTimeout=2000&soWriteTimeout=2000&wireFormat.maxInactivityDuration=3000" > /> > > Consumer configuration: > uri="static:tcp://producerbox:61616?soTimeout=2000&soWriteTimeout=2000&wireFormat.maxInactivityDuration=3000" > /> > > If we do a "cable pull" on one of the consumer boxes we it can take many > minutes before the broker notices that our connection is down. Eventually > we did get a failure from the writeTimeoutFilter but instead of after two > seconds like expected, we didn't get the failure for nearly 5 minutes. When > it finally does trip all of the messages that had been enqueued for the bad > consumer are correctly resent to the good consumer and all the future > traffic is switched over to the good consumer. > > Below is the log from the producer side. We pulled the cable at 14:20:36 > and we get the expected inactivity failure error on the "reverse bridge" > from the consumer broker came a few seconds later. Our "forward bridge" > doesn't fail for around 5 minutes. > > 2015-02-20 14:20:41,798 | WARN | Transport Connection to: tcp:// > 10.22.19.206:44208 failed: > org.apache.activemq.transport.InactivityIOException: Channel was inactive > for too (>3000) long: tcp://10.22.19.206:44208 | > org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ > InactivityMonitor Worker > 2015-02-20 14:25:15,276 | WARN | Forced write timeout for:tcp:// > 10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter | > WriteTimeoutFilter-Timeout-1 > 2015-02-20 14:25:15,278 | WARN | Caught an exception processing local > command | org.apache.activemq.network.DemandForwardingBridgeSupport | > ActiveMQ BrokerService[scaleha-gw2] Task-37 > java.net.SocketException: Socket closed > at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121) > at java.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176) > at > org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304) > at > org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286) > at > org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85) > at > org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104) > at > org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85) > at > org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73) > at > org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68) > at > org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81) > at > org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997) > at > org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206) > at > org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116) > at > org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50) > at > org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138) > at > org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130) > at > org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107) > at > org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68) > at > org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60) > at > org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370) > at > org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889) > at > org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935) > at > org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133) > at > org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > I believe I know what is occurring but we don't know how to fix it. What > we believe is occurring is the soWriteTimeout filter isn't firing because > each socket write is actually succeeding and returning quickly but the data > is being queued into the linux socket caches. The actual write to the > socket doesn't block until we fill those buffers. We validated this by > watching netstat as we ran the test and saw that the send-q slowly filled > up as we ran traffic and then once it stopped growing the socket write > timeout tripped a few seconds later. > > tcp 0 0 10.22.19.205:39466 10.22.19.206:61616 > ESTABLISHED - // normal > tcp 0 7653 10.22.19.205:39466 10.22.19.206:61616 > ESTABLISHED - // early after cable pull > tcp 0 53400 10.22.19.205:39466 10.22.19.206:61616 > ESTABLISHED - // two minutes after cable pull > tcp 0 129489 10.22.19.205:39466 10.22.19.206:61616 > FIN_WAIT1 - // final state after writeTimeout fires > > I tried to rerun this same experiment without any messages flowing and the > detection took 15 minutes. I think this took much longer because the number > of bytes being written to the socket was much lower so we didn't fill the > buffer as quickly. > > I was expecting the InactivityMonitorTImeout to fire to protect us from > this case but I think it considers the connection to be active every time > it dispatches a message to that consumer so that timer never fires either. > > Normally this is handled with an application level timeout. Presumabley in > this case we should be waiting for the acknowledgement of receipt of the > message from the other broker. The stats appear to show the messages as > dispatched but not dequeued while we keep the consumer box off the network. > I was expecting that the inactivity timer was looking the successful > completion of a message, not just it's send. > > Is there a setting somewhere I am missing? It seems like this should be a > relatively common failure mode, maybe other people have enough data flowing > that they fill those buffers incredibly quickly? We have investigated using > keepAlive but in general those timeouts are still too slow for our purposes. > > Sam Hendley > > --001a1141cfea8707da050fb57e18--