From users-return-6044-apmail-qpid-users-archive=qpid.apache.org@qpid.apache.org Thu Mar 15 17:28:09 2012 Return-Path: X-Original-To: apmail-qpid-users-archive@www.apache.org Delivered-To: apmail-qpid-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6ABFA9F30 for ; Thu, 15 Mar 2012 17:28:09 +0000 (UTC) Received: (qmail 70429 invoked by uid 500); 15 Mar 2012 17:28:09 -0000 Delivered-To: apmail-qpid-users-archive@qpid.apache.org Received: (qmail 70300 invoked by uid 500); 15 Mar 2012 17:28:08 -0000 Mailing-List: contact users-help@qpid.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@qpid.apache.org Delivered-To: mailing list users@qpid.apache.org Received: (qmail 70283 invoked by uid 99); 15 Mar 2012 17:28:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 17:28:08 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of fraser.adams@blueyonder.co.uk designates 81.103.221.48 as permitted sender) Received: from [81.103.221.48] (HELO mtaout02-winn.ispmail.ntl.com) (81.103.221.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Mar 2012 17:27:57 +0000 Received: from know-smtpout-4.server.virginmedia.net ([62.254.123.1]) by mtaout02-winn.ispmail.ntl.com (InterMail vM.7.08.04.00 201-2186-134-20080326) with ESMTP id <20120315172737.YQAL20752.mtaout02-winn.ispmail.ntl.com@know-smtpout-4.server.virginmedia.net> for ; Thu, 15 Mar 2012 17:27:37 +0000 Received: from [82.33.36.91] (helo=[192.168.1.2]) by know-smtpout-4.server.virginmedia.net with esmtpa (Exim 4.63) (envelope-from ) id 1S8ESf-0001Ye-6N for users@qpid.apache.org; Thu, 15 Mar 2012 17:27:37 +0000 Message-ID: <4F622688.5040903@blueyonder.co.uk> Date: Thu, 15 Mar 2012 17:27:36 +0000 From: Fraser Adams User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: users@qpid.apache.org Subject: Re: C++ broker memory leak in federated set-up??? References: <87061b4d-56bc-4b60-bee5-47cf31166008@zmail16.collab.prod.int.phx2.redhat.com> <4F60E747.3070700@blueyonder.co.uk> <4F60EF23.8030104@redhat.com> In-Reply-To: <4F60EF23.8030104@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Cloudmark-Analysis: v=1.1 cv=R50lirqlHffDPPkwUlkuVa99MrvKdVWo//yz83qex8g= c=1 sm=0 a=y6CmfMN_vBIA:10 a=La_prg4IR64A:10 a=3NElcqgl2aoA:10 a=IkcTkHD0fZMA:10 a=0JjhwICDH9np8KoR56sA:9 a=TB0vgKCUNiSzqJiStboA:7 a=QEXdDO2ut3YA:10 a=p038Jq_TD1eQ16au:21 a=mzaiimGXBZ1OlqzI:21 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 X-Virus-Checked: Checked by ClamAV on apache.org > Just btw, you can mark the queue as durable without making the message > persistent, in which case there would be no performance penalty. Thanks, yeah I realise that, but you have to explicitly mark the messages as not persistent, certainly for JMS the spec says that should be the default and I *thought* it was the default for C++ clients too. it's something I'm going to look into but the producers aren't directly under my control so I didn't want to have a dependency at that stage in the project, it's less of an issue now as the folks in charge of that aspect are pretty onside. However TBH I *really* wish the queue configuration persistence aspects and the message persistence aspects were separately configurable. Just to make you laugh we also just got bitten with another comedy moment, we hadn't realised there was a bug in the 0.8 python messaging API implementation that transparently makes queues durable by default... We set up a logging client on another part of our system with a circular queue of decent size and last weekend our operational system got totalled. What happened was the consumer client died and because the queue was persistent (unbeknown to us) it had the default journal stuff, so hit the threshold exceed on the journal size rather than on the queue byte size - that blew all our federated links. Monday wasn't a good day :-( if I didn't laugh I'd cry.....fortunately that was why queue routes were a good choice and most of the data had backed up on the queues the other side of the link. > >> So it's not all that complicated, but it's driving me nuts that when the >> source broker is co-located with the producer we have a memory increase, >> but when we host the source broker on a different box it seems to be >> fine. > > That is very weird and my top suspicion would be that there were > different versions of qpid on the two boxes. Can you run colocated on > the 'other' box to rule that in our out? Or have you verified that > they are using the same version of qpid? Well interestingly that was originally the case, the producer and source broker were running 0.6 and the destination broker 0.8 I managed to persuade the folks in charge of the producer system to do a build with 0.8. One of the things I need to try is to stand up two instances of the producer and see if I get a problem if each writes to the others' broker. As it happens when they moved to 0.8 the throughput improved a lot even before we fixed the network madness that forced the producer NIC into acting as 10 base/T half duplex :-( However with 0.8 brokers all round and a sane network we get great message throughput but we still see the memory growing but it's a lot less pronounced than before, but as I say if we point the producer to a broker on a different host it seems to be stable. I *assume* that there are no sneaky little optimisations going on under the hood when client and broker are located on the same host, like "let's do some memory mapping and bypass the TCP/IP stack" :-) I'm guessing not but stranger things have happened..... > > Its a RHEL6 only issue relating to memory allocators in glibc. If both > boxes are 5.4 we can rule that out. That's useful to know, though to be honest it's a pity at least it might preserve my sanity. So presumably this really is only RHEL6? what version of glibc are we talking about, is it one that doesn't actually run on 5.4. The reason I ask is that the producer is a high performance near real time system so they *just may* have got some up to date versions of things installed. They probably don't use it for any whizzy allocation as they use their own memory pooling mechanism to improve multithreaded performance. That has me thinking I'll check with them that they don't do any fancy LD_PRELOAD stuff to override the underlying allocator in a way that may affect other processes. I'm fairly sure they don't but it's worth checking. > To be honest I'm stumped, I'm afraid and can only offer some > suggestions on what I might do to search for any further clues... > > Just to confirm, you have run qpid-stat -c, qpid-stat -q and qpid-stat > -u against a bloating broker? And everything shown there is as > expected (not much queue depth, message counts correlating, no > unexpected activity)? It all *looks* as I'd expect. Clearly when we had network problems the queue on the route was filling up and eventually circled round but now the depth floats around one or two items. > > When the memory growth starts happening, if you delete and recreate > the bridge does that have any effect on growth? That's not something we've tried it's worth looking into, of course Murphy's law has generally kicked in and made the problem most often happen at night :-/ I'm wondering if it's bad karma and I did something awful in a past life :-> > > Is it reproducible at all with more detailed logging (ideally > --log-enable info+ --log-enable trace+:amqp_0_10)? Obviously logs like > that grow pretty quickly so depending on the scale of the leak that > may not be feasible. It might give some clues though (then again it > might not :-(). Perhaps even a short run from both co-located and > remote cases to see if a comparison shows anything up? > It's worth a try. I'm still suspicious of acks as it's the only thing that I've ever seen cause qpidd to bloat in an obvious way, but as I say we're using default route config. so these should be unreliable. I guess under the hood this *really does* use an unreliable link and not just some reasonable number for N when acknowledging? Thanks for for the pointers, even though I'm not much closer to a solution it's nice to have people adding their thoughts it's going to be something really obtuse in the end. I appreciate the moral support!! Cheers, Frase --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org For additional commands, e-mail: users-help@qpid.apache.org