Return-Path: X-Original-To: apmail-activemq-users-archive@www.apache.org Delivered-To: apmail-activemq-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1DCD010096 for ; Tue, 30 Dec 2014 22:47:19 +0000 (UTC) Received: (qmail 54887 invoked by uid 500); 30 Dec 2014 22:47:19 -0000 Delivered-To: apmail-activemq-users-archive@activemq.apache.org Received: (qmail 54850 invoked by uid 500); 30 Dec 2014 22:47:19 -0000 Mailing-List: contact users-help@activemq.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@activemq.apache.org Delivered-To: mailing list users@activemq.apache.org Received: (qmail 54831 invoked by uid 99); 30 Dec 2014 22:47:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2014 22:47:17 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Sam.Hendley@sensus.com designates 8.224.178.181 as permitted sender) Received: from [8.224.178.181] (HELO mx01.sensus.com) (8.224.178.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Dec 2014 22:47:12 +0000 X-ASG-Debug-ID: 1419979580-05920835e7328160001-Ri9J0o Received: from DURCAS01.sensus.net (durcas01.sensus.net [10.25.31.23]) by mx01.sensus.com with ESMTP id bNEJOvIrVaWGCYMu for ; Tue, 30 Dec 2014 17:46:20 -0500 (EST) X-Barracuda-Envelope-From: Sam.Hendley@sensus.com Received: from DUREXC01.sensus.net ([fe80::b884:c8d6:64dc:7b3f]) by DURCAS01.sensus.net ([::1]) with mapi id 14.02.0387.000; Tue, 30 Dec 2014 17:46:20 -0500 From: "Hendley, Sam" To: "users@activemq.apache.org" Subject: Exceeding MemoryUsage causes Network Connector connections to stop Thread-Topic: Exceeding MemoryUsage causes Network Connector connections to stop X-ASG-Orig-Subj: Exceeding MemoryUsage causes Network Connector connections to stop Thread-Index: AdAkgTbCV84mEHCNQwGglui82Eeedg== Date: Tue, 30 Dec 2014 22:46:20 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.1.164.209] Content-Type: multipart/alternative; boundary="_000_E718B7B03B878B4E9F219CC82B6BDAF6B2B2048EDUREXC01sensusn_" MIME-Version: 1.0 X-Barracuda-Connect: durcas01.sensus.net[10.25.31.23] X-Barracuda-Start-Time: 1419979580 X-Barracuda-URL: http://10.25.31.15:8000/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at sensus.com X-Barracuda-BRTS-Status: 1 X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests=HTML_MESSAGE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.13687 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 HTML_MESSAGE BODY: HTML included in message X-Virus-Checked: Checked by ClamAV on apache.org --_000_E718B7B03B878B4E9F219CC82B6BDAF6B2B2048EDUREXC01sensusn_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hello ActiveMQ community: TL;DR: I now think this is really a mis-configuration on our part but it to= ok quite a lot of digging before we nailed the issue, I am reporting this t= o save others time in the future. We are running a "store and forward network of brokers" where each broker i= s connected to all other brokers (full mesh). Our applications connect only= to their local broker. Under load we would occasionally see a broker just = "disappear" from the rest of the cluster and all of the work would end up o= n the remaining nodes. We were having trouble isolating the fault since our= overall system wasn't handling this gracefully and was causing other traff= ic making cause and effect difficult to trace down. I set out to reproduce the failure we were having in as small of a case as = I could. The result is at: https://github.com/samhendley/activemq-bug-repor= ts where I document the experiment more fully. I wasn't able to get a 100% = reproduction, best I could do was get to about 50% of the runs on my machin= e failing. This makes me believe it is probably a race condition, but I was= n't able to find any obvious smoking guns. In short I found that if the overall broker MemoryUsage is exceeded (becaus= e producer flow control is off) then sometimes the network connectors betwe= en the brokers would become stuck. If I enabled producer flow control or in= creased the configured max memory the issue was no longer reproducible. It looks like we can reconfigure our production systems to workaround this = problem but should I file a bug for this? A silent failure like this is rea= lly not fun to run to diagnose on a large scale system. Sam >From github page: Bug description: If the configured MemoryStore limit is large enough to stay below 100% whil= e the requestor application is dumping messages into the broker network the= tests passes successfully. If however the memory usage on the brokers goes= larger than 100% (in this case peaking around 600% of 100 Mb) the network = connectors sometimes become "stuck". Stuck in this case means there are mes= sages enqueued on one or both of the "server" brokers but the messages are = not being dequeued or forwarded by the network connector back to the "clien= t" broker. This issues doesn't happen with every run with a small memory size but in m= y tests it generally failed about 50% of the times I tried running it. You = may have to run it a few times before getting it to fail. On one failure JM= X showed that 417k responses had been generated on server1 but only 363k ha= d been dequeued for transmission to the client broker. In that test run the= other server had correctly handled the other 583k requests. When it does fail there is nothing in the log that indicates anything is am= iss. I would have expected to see some sort of log message to indicate that= the network connector has been throttled (if indeed that is what is happen= ing). This same test done with a single broker always passes which leads me= to believe it really is a problem with the network connectors. --_000_E718B7B03B878B4E9F219CC82B6BDAF6B2B2048EDUREXC01sensusn_--