qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Conway <acon...@redhat.com>
Subject Re: An ill borker brings down the whole cluster
Date Wed, 04 Nov 2009 14:09:44 GMT
On 11/03/2009 04:41 PM, Shan Wang wrote:
> Client side we are still using 0.4, I'm not sure about the exact version, should be last
version before 0.5.
> Cluster side we are using 0.5.752581-26.el5.
>
> Unfortunately I haven't got the environment to build qpid myself so I can't use latest
trunk.

I'd like to try an reproduce your issue, need some more details:

>> On 11/03/2009 06:13 AM, Shan Wang wrote:
>>> Hi All,
>>>
>>> We have two qpid 0.5 brokers running in cluster mode on two different
>>> boxes. The cluster works fine in normal cases, ie, if broker1 is
>>> shutdown cleanly, broker2 will keep on serving clients. But today we
>>> found one broker suddenly lost response to all connected clients and
>>> admin tools. All producer and consumer clients are still connected
>>> but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of them 
become unresponsive?

The command line
>>> admin tool failed with a time out error. The only error message we
>>> found is in the log of broker 1, which said this:
>>>
>>> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
>>> error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
>>> Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
>>> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
>>> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they were 
unresponsive? Can you run the broker with

  --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the 
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and 
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?

>>>
>>> After only restarted broker 1, everything starts to work again. So
>>> surprisingly it seems when one of the brokers in the cluster suffered
>>> a problem, the whole cluster just stalled, at least from the
>>> consumer's point of view ( I can't be sure if the producer was
>>> working during the down time, after back to normal, consumer did
>>> receive messages sent sometime ago ). Consumer program uses
>>> FailoverManager and AsyncSession, basically not far from the failover
>>> example in the qpid developing doc. So can anyone please tell me what
>>> the above error message means and have we seen similar problems to
>>> the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at this 
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


Mime
View raw message