qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Trieloff <cctriel...@redhat.com>
Subject Re: An ill borker brings down the whole cluster
Date Tue, 03 Nov 2009 13:53:03 GMT

I don't have enough info to comment on the root cause, Maybe Alan can 
based on the log snippet, however there is a pulg-in module that can be 
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster 
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of 
teh cluster will continue to run, but AIS will cache for the node that 
is stopped. It is required that node be evicted at some point if it does 
not get a sig-cont after a period of time. The watchdog plugin does this 
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in 
your example below.  The second part is to understand why it was 
unresponsive.

Carl.


Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. The cluster
works fine in normal cases, ie, if broker1 is shutdown cleanly, broker2 will keep on serving
clients. But today we found one broker suddenly lost response to all connected clients and
admin tools. All producer and consumer clients are still connected but failed to consume any
messages from the queue. The command line admin tool failed with a time out error. The only
error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 157487219 on
172.27.34.201:9908-389(local): transport-busy: Channel 1 already attached to guest@QPID.amq.failover676a76fa-56
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) (unresolved: 172.27.34.201:9908
172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So surprisingly it seems
when one of the brokers in the cluster suffered a problem, the whole cluster just stalled,
at least from the consumer's point of view ( I can't be sure if the producer was working during
the down time, after back to normal, consumer did receive messages sent sometime ago ). Consumer
program uses FailoverManager and AsyncSession, basically not far from the failover example
in the qpid developing doc. So can anyone please tell me what the above error message means
and have we seen similar problems to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the use of the
addressee only, unless otherwise indicated. If you are not the intended recipient, please
do not read, copy, use or disclose to others this message or any attachment. Please also notify
the sender by replying to this email or by telephone (+44 (0)20 7896 0011) and then delete
the email and any copies of it. Opinions, conclusions (etc.) that do not relate to the official
business of this company shall be understood as neither given nor endorsed by it. IG Index
Ltd is a company registered in England and Wales under number 01190902. VAT registration number
761 2978 07. Registered Office: Friars House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised
and regulated by the Financial Services Authority. FSA Register number 114059.
>
>   


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


Mime
View raw message