qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Colby <p...@colby.id.au>
Subject Re: 0.14 cluster never survives more than an hour or so.
Date Sat, 14 Apr 2012 05:42:01 GMT
Thanks for the direction Pavel!!  I've found the problem! :)

In short it was a configuration management service (Puppet this case)
restarting the network subsystem on all three servers in the cluster at
once.

I assume at this stage, that the network restarts are a result of a puppet
mis-configuraton (I'll check that our with our ops guys on Monday).  But
effectively, as I understand it from looking at the debug logs, the
temporary loss of network causes all three brokers to think that they are
now the elder (a split brain scenario), so when the network is restored
seconds later, they all realise something is horribly wrong, and shutdown
immediately (as they ought).

So, I guess in this case there's two things I should do the guard against
this happening (besides tweaking puppet):
1. increase the cluster-size parameter (currently 0 for testing).
2. use cman - something I will definitely need to look into next :)

Thanks again!

pc
----
http://colby.id.au


On Fri, Apr 13, 2012 at 8:48 PM, Pavel Moravec <pmoravec@redhat.com> wrote:

> Hi Paul,
> both errors occur under very similar circumstances. I recommend enabling
> debug logs of cluster component by adding:
>
> log-enable=debug+:cluster
> log-enable=notice+
>
> to qpidd.conf and post the logs to a new JIRA. (you can try enabling trace
> logs that might provide more verbose output but running traces for 1/2 hour
> would require some nontrivial disk space)
>
> To alleviate consequences, I think disabling management shall help (but
> some other problems can arise later on somewhere else, as this just
> prevents the consequence and not the root cause bug). And some QMF based
> services (like qpid-tool) won't work with management disabled.
>
> To disable management stuff, add to qpidd.conf:
>
> mgmt-enable=no
>
> Alternatively, one can setup frequency of management updates (that are
> processed by the periodicProcessing task), see mgmt-pub-interval option
> (set by default to 10 seconds). Setting it to e.g. 2 hours, your qpid
> cluster will run for at least 2 hours without the error. But again, some
> QMF based services rely on the updates.
>
>
> Kind regards,
> Pavel Moravec
>
>
> ----- Original Message -----
> > From: "Paul Colby" <paul@colby.id.au>
> > To: users@qpid.apache.org
> > Sent: Friday, April 13, 2012 11:02:14 AM
> > Subject: Re: 0.14 cluster never survives more than an hour or so.
> >
> > Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369
> >  has not
> > fixed the issue.
> >
> > Interestingly though, it did move the error to a different line, but
> > with a
> > very similar message. eg
> >
> > Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical
> > Error
> > delivering frames: Cluster timer wakeup non-existent task
> > ManagementAgent::periodicProcessing
> > (qpid/cluster/ClusterTimer.cpp:112)
> >
> > So it's moved from  ClusterTimer::deliverDrop
> > to ClusterTimer::deliverWakeup instead... but with the same effectual
> > result.
> >
> > pc
> > ----
> > http://colby.id.au
> >
> >
> > On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <paul@colby.id.au> wrote:
> >
> > > Thanks Pavel and Gordon, I really appreciate you guys getting back
> > > to me
> > > so quickly :)
> > >
> > > I'm not currently using cman.  I hadn't been using it on 0.12
> > > either.  I
> > > suspect that split-brain is not the case, since the test cluster in
> > > question on on virtual machines all within a single host, with
> > > *very*
> > > reliable virtual networking between them.  After reading your
> > > response, I
> > > did have a quick look at setting up cman to verify either way, but
> > > that's
> > > not proving to be quick and easy, so I'll come back to it shortly.
> > >
> > > The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> > > interesting.  I'll apply the patch suggested there and see what
> > > difference
> > > it makes.
> > >
> > > Thanks again.  I'll let you know how it goes :)
> > >
> > > pc
> > > ----
> > > http://colby.id.au
> > >
> > >
> > >
> > > On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec
> > > <pmoravec@redhat.com>wrote:
> > >
> > >> Hi Paul,
> > >> this usually happens as a consequence of cluster split-brain. Are
> > >> you
> > >> using CMAN (Cluster Manager)?
> > >>
> > >> (Technically, when split brain occurs, two (or more) qpid brokers
> > >> think
> > >> they are the elder nodes (elder node = "the managing" node,
> > >> usually the
> > >> node that is oldest in the cluster). But there can be just one
> > >> elder node
> > >> in a cluster, as the elder node periodically invokes
> > >> periodicProcessing
> > >> task cluster-wide that can run just one at a time. When more elder
> > >> nodes
> > >> are present, all invokes the task on every cluster member, causing
> > >> more
> > >> tasks to be executed - that is prevented by broker shutdown.)
> > >>
> > >> Kind regards,
> > >> Pavel Moravec
> > >>
> > >>
> > >> ----- Original Message -----
> > >> > From: "Paul Colby" <paul@colby.id.au>
> > >> > To: users@qpid.apache.org
> > >> > Sent: Thursday, April 12, 2012 5:08:01 AM
> > >> > Subject: 0.14 cluster never survives more than an hour or so.
> > >> >
> > >> > Hi guys,
> > >> >
> > >> > I'm having an issue with my new 0.14 cluster, where the same
> > >> > configuration
> > >> > was fine with 0.12.
> > >> >
> > >> > The cluster starts up, and all brokers are happy.  Then, with no
> > >> > client
> > >> > activity at all, after some seemingly random amount time
> > >> > (usually
> > >> > around 30
> > >> > minutes to an hour) all brokers in the cluster (three, in this
> > >> > case)
> > >> > report
> > >> > the following error:
> > >> >
> > >> > critical Error delivering frames: Cluster timer drop
> > >> > non-existent
> > >> > task
> > >> > ManagementAgent::periodicProcessing
> > >> > (qpid/cluster/ClusterTimer.cpp:128)
> > >> >
> > >> > Then they all shutdown, leaving their respective stores dirty :(
> > >> >
> > >> > Any ideas what might be going wrong here?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > pc
> > >> > ----
> > >> > http://colby.id.au
> > >> >
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> > >> For additional commands, e-mail: users-help@qpid.apache.org
> > >>
> > >>
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message