From users-return-2130-apmail-qpid-users-archive=qpid.apache.org@qpid.apache.org Wed Nov 04 15:37:33 2009 Return-Path: Delivered-To: apmail-qpid-users-archive@www.apache.org Received: (qmail 71977 invoked from network); 4 Nov 2009 15:37:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Nov 2009 15:37:32 -0000 Received: (qmail 38810 invoked by uid 500); 4 Nov 2009 15:37:32 -0000 Delivered-To: apmail-qpid-users-archive@qpid.apache.org Received: (qmail 38712 invoked by uid 500); 4 Nov 2009 15:37:32 -0000 Mailing-List: contact users-help@qpid.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@qpid.apache.org Delivered-To: mailing list users@qpid.apache.org Received: (qmail 38693 invoked by uid 99); 4 Nov 2009 15:37:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Nov 2009 15:37:32 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [207.126.144.113] (HELO eu1sys200aog102.obsmtp.com) (207.126.144.113) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 04 Nov 2009 15:37:21 +0000 Received: from source ([193.30.41.134]) by eu1sys200aob102.postini.com ([207.126.147.11]) with SMTP ID DSNKSvGfm0ObFvqy0+8J2gq1LZ0+RACOkGA9@postini.com; Wed, 04 Nov 2009 15:37:01 UTC Received: from LMPRDEXC203.igi.ig.local (lmprdexc203 [172.27.11.195]) by bsprdinf008.iggroup.local (8.13.8+Sun/8.12.10) with ESMTP id nA4Fax0n010665; Wed, 4 Nov 2009 15:36:59 GMT Received: from PRDEXC101.igi.ig.local ([fe80::b86f:4c56:54b7:71a]) by LMPRDEXC203.igi.ig.local ([fe80::f1eb:f4bc:62f7:3e06%15]) with mapi; Wed, 4 Nov 2009 15:36:59 +0000 From: Shan Wang To: "users@qpid.apache.org" , "dev@qpid.apache.org" CC: "cctrieloff@redhat.com" Date: Wed, 4 Nov 2009 15:36:58 +0000 Subject: RE: An ill borker brings down the whole cluster Thread-Topic: An ill borker brings down the whole cluster Thread-Index: AcpdWP/XlqZ+MaeDRtSTQn0EOmIB4wAB/FSw Message-ID: References: <4AF08431.2000105@redhat.com> <4AF08F93.1030402@redhat.com> <4AF18B28.8010702@redhat.com> In-Reply-To: <4AF18B28.8010702@redhat.com> Accept-Language: en-US, en-GB Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: acceptlanguage: en-US, en-GB x-ig-disclaimer: IG-Disclaimer-Set Content-Type: multipart/mixed; boundary="_002_C190ADE085279E4AAA53D80AA839E373096ED28543PRDEXC101igii_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_002_C190ADE085279E4AAA53D80AA839E373096ED28543PRDEXC101igii_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Alan, The whole cluster lost response, but qpid-tool is still able to connect to = broker2 but not broker1, based on that I suppose it's broker1 became ill, a= nd restart of broker1 cured the whole cluster. The full log of broker1 from 31-OCT is attached. Now we have turned log lev= els to info+ and will apply --log-enable=3Ddebug+:cluster later. Before hanging, there are many clients sending messages to the cluster, I d= on't know the exact number of clients but usually between 150-200, the upda= te rate was about 5-10 MB/minute. The receiver was receiving messages ok bu= t suddenly stopped working. I believe the receiver stopped working before s= ender, because after things back to normal, we can see very old messages in= the receiver's log, but not relative recent messages commited after the pr= oblem. The affected system carries pretty serious tasks so I can't play with it as= I wish, nor did I try the sender/receiver example. But as my latest email = said, the problem re-occurred this morning, this time with broker2. The given link could be a similar issue, but the question is what caused er= rors in cluster? Regards, Shan -----Original Message----- From: Alan Conway [mailto:aconway@redhat.com] Sent: 04 November 2009 14:10 To: dev@qpid.apache.org Cc: cctrieloff@redhat.com; users@qpid.apache.org Subject: Re: An ill borker brings down the whole cluster On 11/03/2009 04:41 PM, Shan Wang wrote: > Client side we are still using 0.4, I'm not sure about the exact version,= should be last version before 0.5. > Cluster side we are using 0.5.752581-26.el5. > > Unfortunately I haven't got the environment to build qpid myself so I can= 't use latest trunk. I'd like to try an reproduce your issue, need some more details: >> On 11/03/2009 06:13 AM, Shan Wang wrote: >>> Hi All, >>> >>> We have two qpid 0.5 brokers running in cluster mode on two different >>> boxes. The cluster works fine in normal cases, ie, if broker1 is >>> shutdown cleanly, broker2 will keep on serving clients. But today we >>> found one broker suddenly lost response to all connected clients and >>> admin tools. All producer and consumer clients are still connected >>> but failed to consume any messages from the queue. Just to clarify: did only one broker become unresponsive or did both of the= m become unresponsive? The command line >>> admin tool failed with a time out error. The only error message we >>> found is in the log of broker 1, which said this: >>> >>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel >>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy: >>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56 >>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) >>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 ) Do you still have the full logs of both brokers at the time they were unresponsive? Can you run the broker with --log-enable=3Dnotify+ --log-enable=3Ddebug+:cluster for future runs so we can hopefully get a bit more information about what t= he cluster is doing at the time of the hang? What are your clients doing? Can you reproduce the problem using the sender= and receiver examples? How many clients are running against each broker? How easy is it to reproduce the problem? >>> >>> After only restarted broker 1, everything starts to work again. So >>> surprisingly it seems when one of the brokers in the cluster suffered >>> a problem, the whole cluster just stalled, at least from the >>> consumer's point of view ( I can't be sure if the producer was >>> working during the down time, after back to normal, consumer did >>> receive messages sent sometime ago ). Consumer program uses >>> FailoverManager and AsyncSession, basically not far from the failover >>> example in the qpid developing doc. So can anyone please tell me what >>> the above error message means and have we seen similar problems to >>> the cluster before? Yes I've seen similar problems before, but believe them all to be fixed at = this point on trunk. It might be the issue fixed by http://svn.apache.org/viewvc?view=3Drevision&revision=3D799687 If I can reproduce the problem then I can verify if it is fixed on trunk. Cheers, Alan. --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:users-subscribe@qpid.apache.org The information contained in this email is strictly confidential and for th= e use of the addressee only, unless otherwise indicated. If you are not the= intended recipient, please do not read, copy, use or disclose to others th= is message or any attachment. Please also notify the sender by replying to = this email or by telephone (+44 (0)20 7896 0011) and then delete the email = and any copies of it. Opinions, conclusions (etc.) that do not relate to th= e official business of this company shall be understood as neither given no= r endorsed by it. IG Index Ltd is a company registered in England and Wales= under number 01190902. VAT registration number 761 2978 07. Registered Off= ice: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and= regulated by the Financial Services Authority. FSA Register number 114059. --_002_C190ADE085279E4AAA53D80AA839E373096ED28543PRDEXC101igii_ Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:users-subscribe@qpid.apache.org --_002_C190ADE085279E4AAA53D80AA839E373096ED28543PRDEXC101igii_--