qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Scholz <ja...@scholz.cz>
Subject Re: How to detect the initial synchronization of brokers in cluster
Date Thu, 21 Jul 2011 17:25:17 GMT
Hi Alan,

Thanks for your answer ...

> Yes, during update the cluster accepts connections but doesn't read from
> them until update is complete.
> Would it improve things for you if a cluster rejected connections during
> update - allowing the client to fail over to a different broker?

I don't think the failover is possible here, because all running
members of the cluster are in this state. I assume this is because
they cannot synchronize the store while changing the store at the same
time ... so the failover within the cluster is anyway not possible.

I think that rejecting of the connections would be better then
accepting the connection and doing nothing. There are some
inconsistencies in the behaviour of the APIs ... Python QPID API seems
to be waiting forever, Java seems to timeout, QMF behaves strangly (as
mentioned). So I think that rejecting the connections or maybe even
better - letting the connections to timeout - would be better
approach.

But I don't think this is a big issue, since it doesn't solve the
initial "problem" - that the broker is not working and ready for
clients.

>> Does someone know whether ...
>> - There is some way to detect that the cluster is currently
>> synchronizing new node or that the synchronization already finished?
>
> Not presently. What would you do with that information if you could get it?

I believe it is an important information for the party operating the
broker - not the fact that it is synchronizing, but the fact that it
is not ready / fully operational. With a very large storage, I start
all the brokers in the cluster as deamons and all the calls of qpidd
return success. So I would expect that everything is up and running,
ready for clients to connect and send messages. But it isn't it - it
is synchronizing the storage for another 10 minutes.

>> - In case I stopped all my brokers cleanly using qpid-cluster, is
>> there some way to start all brokers synchronously from their own
>> stores (which are clean at the moment)? It would save a lot of time
>> for the initial store synchronization.
>
> That should work. You need to configure cluster-size so the brokers can all
> synchronize their initial state. If it doesn't work raise a JIRA and mail me
> so I don't miss it.

I found the "cluster-size" option only yesterday. It works and it
seems to help with this problem. I actually saw this option already
before, but I expected that it is a minimal number of nodes in the
cluster to run in general. Not just during an initial startup. A name
"cluster.initial-size" would be probably more appropriate (just a
comment, it doesn't really make sense to rename it now).

>> - Is the QMF problem described above known (I didn't found anything in
>> JIRA)? Eventually is it worth entering an Issue report?
>
> I'm guessing that python client sets up its timeout after the connection is
> open, so it's not taking effect if the connection opening itself takes a
> long time (which happens because of the "half running" state.) If that the
> case it's worth a JIRA.
>

It is actually the oposite ... the timeout in the QMF API seems to be
used only for connecting to the broker. And since the broker accepts
the connections even when still synchronizing, it connects without any
problems. It seems to timeout only later on one of the other QMF
calls. I will try to find where exactly it times out and I will enter
a JIRA.

Thanks & Regards
Jakub


On Thu, Jul 21, 2011 at 16:18, Alan Conway <aconway@redhat.com> wrote:
> On 07/15/2011 11:43 AM, Jakub Scholz wrote:
>>
>> Hi,
>>
>> We are running the Qpid/MRG broker in cluster consisting usually from
>> 2-4 nodes. Our broker configuration contains many persistent queues
>> (currently several GB, in the future even more). When starting the
>> cluster (with already existing store), the nodes are being started one
>> after another. Due to the way the clustering is designed, the first
>> broker starts and recovers the store from the disk. All the other
>> brokers which start later have to discard the existing store and get
>> all data freshly from the broker(s) which is already running. This
>> initial synchronization of the store can be quite long if the store
>> contains several GB of persistent queues (that's something to be
>> expected).
>>
>> Unfortunately, during this initial synchronization, the cluster seems
>> to be in some kind of "half running" state. Clients seem to be able to
>> connect, but they cannot do anything.
>
> Yes, during update the cluster accepts connections but doesn't read from
> them until update is complete.
> Would it improve things for you if a cluster rejected connections during
> update - allowing the client to fail over to a different broker?
>
>> Especially all python based
>> tools using QMF (e.g. qpid-config, qpid-cluster) seem to timeout
>> during this period instead of waiting for the cluster synchronization
>> to finish. The --timeout parameter for the qpid-config (which is
>> passed to the QMF API in the session->addBroker method) doesn't really
>> seem to help here.
>>
>> Does someone know whether ...
>> - There is some way to detect that the cluster is currently
>> synchronizing new node or that the synchronization already finished?
>
> Not presently. What would you do with that information if you could get it?
>
>> - In case I stopped all my brokers cleanly using qpid-cluster, is
>> there some way to start all brokers synchronously from their own
>> stores (which are clean at the moment)? It would save a lot of time
>> for the initial store synchronization.
>
> That should work. You need to configure cluster-size so the brokers can all
> synchronize their initial state. If it doesn't work raise a JIRA and mail me
> so I don't miss it.
>
>> - Is the QMF problem described above known (I didn't found anything in
>> JIRA)? Eventually is it worth entering an Issue report?
>
> I'm guessing that python client sets up its timeout after the connection is
> open, so it's not taking effect if the connection opening itself takes a
> long time (which happens because of the "half running" state.) If that the
> case it's worth a JIRA.
>

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Mime
View raw message