qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Armstrong <jarmstr...@avvasi.com>
Subject Major slowdown on Qpid broker
Date Tue, 13 Mar 2012 22:35:31 GMT
I have a situation where the Qpid broker slows down tremendously, to the point where enqueues
stop altogether for long periods of time and dequeuing is also quite slow. When I look at
htop, there are 2 qpid threads running at 100% CPU. When debugging in gdb, I see that every
time I do a backtrace, these two threads are somewhere in the RingQueuePolicy::find(), and
further up the stack it shows that it is in DeliveryRecord::accept().

Qpid broker/client setup:
- Ubuntu 10.0.4
- 0.12 C++ clients and brokers
- Exchange options: direct
- Queue options: Ring policy, max size ~400MB
- Subscriber options: autoAck = 0, acceptMode = ACCEPT_MODE_EXPLICIT, completionMode = COMPLETE_ON_ACCEPT
- Message options: Delivery mode PERSISTENT

I have 2 blades in a chassis, each with a broker running and a single sender client that enqueues
different messages to several bindings on the local broker. Each blade also has two receiver
clients that dequeue messages, but only one blade's receiver clients are "active", meaning
they connect to the brokers on both blades, whereas the receiver clients on the "standby"
blade do nothing. If an active blade goes down, the standby blade becomes active, the receiver
clients there will now connect to the brokers and start dequeuing, while the other blade will
eventually reboot into standby mode. 

The two receiver clients each subscribe to their own single queue, which are attached to the
same binding on the same exchange. The clients' normal behaviour is to dequeue messages for
15 minutes, finish processing them, then send an accept() on the subscription of all the processed
id's. The idea is that if the active blade goes down, all of the messages that were not accept()ed
will be lost, so the clients on the standby blade will then connect and should get these messages
redelivered. This seems to have worked in the past.

The following events occurred (note that only blade 1 is actually enqueuing to its broker,
blade 2 has no enqueuing going on, this is on purpose):
- blade 1 (active) and blade 2 (standby)
- blade 1 reboots, so blade 2 becomes active then blade 1 comes up into standby
- blade 2 then reboots, so blade 1 becomes active then blade 2 goes into standby

We then made the following observations:
- When blade 2 reboots, and blade 1 becomes active, the receiver clients never output any
of the expected redelivered messages. We think that the redelivery never took place.
- When inspecting the 'unacked' queue in SemanticState (and also the queue in RingQueuePolicy)
in gdb, we noticed about 100,000 messages in each client's queue with old sequence numbers
that correspond to 2/3 of the messages that we never saw redelivered
- The first 1/3 or so of the messages we expected to be redelivered weren't in those queues
- When we finally stopped one of the receiver clients, it cored (aborted), the other receiver
client died, and the qpid broker also cored
- There was a logged qpid::TransportFailure exception that happened right before all of these

Here are some of our thoughts/questions:
- We think the 1/3 of the messages that vanished might have been because the queue filled,
and the ring policy caused them to be deleted
- We think that the 2/3 of the messages we expected to be redelivered, might not have got
redelivered because the session on the new clients might have been started before the sessions
of the clients that went down with a reboot were ended. Is there some sort of session timeout
that must occur before the new session gets these redelivered? What happens in this case?
- We think the slowdown is because of the 100,000 unaccepted messages on the front of the
RingQueuePolicy's queue. We send about 200,000 ids to accept after a 15 minute period, so
for each of these messages, it will have to traverse over the 100,000 unaccepted ids. Could
this account for such a huge slowdown and 100% cpu usage on the accept() with 200,000 ids?
- No ideas about the crashes that occurred when we tried to stop one of the receiver clients

Any help or ideas on this issue would be a big help. Thank you.

To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

View raw message