qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Armstrong <jarmstr...@avvasi.com>
Subject RE: Major slowdown on Qpid broker
Date Wed, 14 Mar 2012 18:13:19 GMT

From: Gordon Sim [gsim@redhat.com]
Sent: Wednesday, March 14, 2012 11:30 AM
To: users@qpid.apache.org
Subject: Re: Major slowdown on Qpid broker

On 03/14/2012 02:20 PM, Jeff Armstrong wrote:
> Answers are inline.
> ________________________________________
> From: Gordon Sim [gsim@redhat.com]
> Sent: Wednesday, March 14, 2012 6:39 AM
> To: users@qpid.apache.org
> Subject: Re: Major slowdown on Qpid broker
> On 03/13/2012 10:35 PM, Jeff Armstrong wrote:
>> If an active blade goes down, the standby blade becomes active, the receiver clients
there will now connect to the brokers and start dequeuing,
> How do the published messages get to the other broker?
> Jeff: In this case, the unacquired messages on the broker of the blade that goes down
will be lost, which is expected at this point.

So which messages do you expect to be redelivered?

Jeff: Blade 1 is in standby and Blade 2 is active. The client on blade 2 will be acquiring
messages from the broker on blade 1. Blade 2 then goes down before it has a chance to ack
the messages. Blade 1 becomes active, so the client there should connect to the broker and
get those messages redelivered, but none of these were.

>> while the other blade will eventually reboot into standby mode.
>> The two receiver clients each subscribe to their own single queue, which are attached
to the same binding on the same exchange. The clients' normal behaviour is to dequeue messages
for 15 minutes, finish processing them, then send an accept() on the subscription of all the
processed id's. The idea is that if the active blade goes down, all of the messages that were
not accept()ed will be lost, so the clients on the standby blade will then connect and should
get these messages redelivered. This seems to have worked in the past.

How do those messages that were not accepted get to the standby broker?

Jeff: Only the messages on the broker of the blade that didn't go down should be redelivered.
Messages do not move between brokers. I hope the response above clears this up.

>> The following events occurred (note that only blade 1 is actually enqueuing to its
broker, blade 2 has no enqueuing going on, this is on purpose):
>> - blade 1 (active) and blade 2 (standby)
>> - blade 1 reboots, so blade 2 becomes active then blade 1 comes up into standby
>> - blade 2 then reboots, so blade 1 becomes active then blade 2 goes into standby
>> We then made the following observations:
>> - When blade 2 reboots, and blade 1 becomes active, the receiver clients never output
any of the expected redelivered messages. We think that the redelivery never took place.
>> - When inspecting the 'unacked' queue in SemanticState (and also the queue in RingQueuePolicy)
in gdb, we noticed about 100,000 messages in each client's queue with old sequence numbers
that correspond to 2/3 of the messages that we never saw redelivered
>> - The first 1/3 or so of the messages we expected to be redelivered weren't in those
>> - When we finally stopped one of the receiver clients, it cored (aborted), the other
receiver client died, and the qpid broker also cored
>> - There was a logged qpid::TransportFailure exception that happened right before
all of these crashed
>> Here are some of our thoughts/questions:
>> - We think the 1/3 of the messages that vanished might have been because the queue
filled, and the ring policy caused them to be deleted
>> - We think that the 2/3 of the messages we expected to be redelivered, might not
have got redelivered because the session on the new clients might have been started before
the sessions of the clients that went down with a reboot were ended. Is there some sort of
session timeout that must occur before the new session gets these redelivered? What happens
in this case?
> I'm still not quite clear on what exactly your clients do.
> Jeff: A receiver client subscribes to a single queue, and using a qpid::broker::LocalQueue,
gets messages, does some processing on them, and writes the processed messages to an open
temporary file. On a 15-minute interval, the client will move the temporary file to a permanent
output directory, and then send an accept() to the broker for all the messages that were in
that file, since they have now been fully processed.
>> - We think the slowdown is because of the 100,000 unaccepted messages on the front
of the RingQueuePolicy's queue. We send about 200,000 ids to accept after a 15 minute period,
so for each of these messages, it will have to traverse over the 100,000 unaccepted ids. Could
this account for such a huge slowdown and 100% cpu usage on the accept() with 200,000 ids?
> Yes, it could, especially if some of those messages have been removed
> from the ring queue already to make room for newer messages.
> Jeff: After looking through the code and doing some debugging on the broker, it's still
not clear to me how the messages are stored. There seem to be several deques that correspond
to a single queue on the broker. Are you saying that if a message is removed from the ring
queue, the broker still maintains a reference to that message if it was unacked? If so, is
this a leak, or does the a copy of then actual message still exist somewhere else?
> Can you send accepts more frequently? Batching accepts is good to some
> extent, but if you can reduce the set of in-doubt messages held by the
> broker you will likely improve the performance.
> Jeff: I can configure the time interval to be a bit lower. I think the performance is
only affected if the broker keeps a bunch of unacked messages that it also never redelivers
(which sounds like a bug).

The broker certainly tracks delivered messages until they are acked. If
the session they were delivered to ends without the messages being acked
they will be requeued and redelivered to the next available subscriber.

Jeff: What happens if a session has unacked messages, the client dies (blade goes down), and
then the client on the other blade becomes active and creates a session before the broker
realizes that the old session should be gone? I'm assuming there is some sort of session timeout
and this could occur. If so, I would expect the new session not to immediately get the redeliveries
until AFTER the broker times out the old session - is there a mechanism to handle this case?
Also, When I examined the queue in RingQueuePolicy (member 'queue') and the one in SemanticState
(member 'unacked'), both appeared to have the same messages - the ones that should have been
redelivered but weren't.

As above, I'm not clear in the case of a blade failure how these
messages from the active broker get to the standby broker from where
they can be redelivered... or is the expectation simply that since they
are durable they will be redelivered once the original active broker
recovers and becomes active again?

In the latter case, the recovery should have all the unacked messages
back on the queue ready for delivery to any available subscriber.
(Obviously in the case of a ring queue only a finite number of messages
will be recovered, as once the configured size is reached the older
messages are deleted).

> The other strange thing is that I have tried to simulate this same scenario by acquiring
100k messages and never accepting them, then continually acquiring and then accepting batches
of 200k messages, and the performance is still very fast. The difference was that it never
seems to call into the RingQueuePolicy, since the policy pointer is null on the queue. I guess
this means that most of the work is actually done in the RingQueuePolicy::dequeued()/RingQueuePolicy::find()
- which matches the fact that every backtrace I got was somewhere in there.

Yes, the ring queue has some undesirable inefficiencies, particularly
where you ack a message that has already been deleted on a large queue.
(That is something that we will be fixing in the not too distant future
I hope, enabled by some refactoring of the queueing code).

Jeff: But, if a message is deleted due to the ring queue size being exceeded, shouldn't that
message be fully deleted and removed from all lists? Otherwise the message never truly gets

> On a side note, I'm not sure why my queue on my attempt at simulating the problem didn't
have a policy, since I also set the queue options to have a ring policy. Any ideas there?

Would need more detail on the steps you used.

Jeff: I basically copied the same options/setup.
1) I create an exchange:
    qpid-config add exchange direct EXCHANGENAME
2) I start 2 clients that each create a queue and bind to the same binding and then subscribe
to it:
    QueueOptions queueOptions;
    queueOptions.setSizePolicy(RING, 524288000, 0); // 500 MB
    session.queueDeclare(arg::queue = queueName, arg::arguments = queueOptions);
    session.exchangeBind(queueName, exchangeName, bindingName);
    SubscriptionSettings subSettings;
    subSettings.autoAck = 0;
    subSettings.acceptMode = ACCEPT_MODE_EXPLICIT;
    subSettings.completionMode = COMPLETE_ON_ACCEPT;
    subscriptions.subscribe(localQueue, queueName, subSettings);
3) I start a sender client that continuously sends messages to the binding that I set up:
    Message msg;
    uint32_t messageCount = 0;
    while (true) {
        session.messageTransfer(qpid::client::arg::content=msg, qpid::client::arg::destination=exchangeName);
4) The 2 receiver clients will acquire 100k messages each and never accept them. Then they
will enter a loop where they acquire 200k messages, then ack them all.
5) In gdb, I examine the queue during the accept() on the broker, and it does not have a policy,
so it doesn't call RingQueuePolicy::dequeued().

To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

View raw message