qpid-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavel Moravec <pmora...@redhat.com>
Subject Re: 0.14 cluster never survives more than an hour or so.
Date Fri, 13 Apr 2012 10:48:47 GMT
Hi Paul,
both errors occur under very similar circumstances. I recommend enabling debug logs of cluster
component by adding:

log-enable=debug+:cluster
log-enable=notice+

to qpidd.conf and post the logs to a new JIRA. (you can try enabling trace logs that might
provide more verbose output but running traces for 1/2 hour would require some nontrivial
disk space)

To alleviate consequences, I think disabling management shall help (but some other problems
can arise later on somewhere else, as this just prevents the consequence and not the root
cause bug). And some QMF based services (like qpid-tool) won't work with management disabled.

To disable management stuff, add to qpidd.conf:

mgmt-enable=no

Alternatively, one can setup frequency of management updates (that are processed by the periodicProcessing
task), see mgmt-pub-interval option (set by default to 10 seconds). Setting it to e.g. 2 hours,
your qpid cluster will run for at least 2 hours without the error. But again, some QMF based
services rely on the updates.


Kind regards,
Pavel Moravec


----- Original Message -----
> From: "Paul Colby" <paul@colby.id.au>
> To: users@qpid.apache.org
> Sent: Friday, April 13, 2012 11:02:14 AM
> Subject: Re: 0.14 cluster never survives more than an hour or so.
> 
> Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369
>  has not
> fixed the issue.
> 
> Interestingly though, it did move the error to a different line, but
> with a
> very similar message. eg
> 
> Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical
> Error
> delivering frames: Cluster timer wakeup non-existent task
> ManagementAgent::periodicProcessing
> (qpid/cluster/ClusterTimer.cpp:112)
> 
> So it's moved from  ClusterTimer::deliverDrop
> to ClusterTimer::deliverWakeup instead... but with the same effectual
> result.
> 
> pc
> ----
> http://colby.id.au
> 
> 
> On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <paul@colby.id.au> wrote:
> 
> > Thanks Pavel and Gordon, I really appreciate you guys getting back
> > to me
> > so quickly :)
> >
> > I'm not currently using cman.  I hadn't been using it on 0.12
> > either.  I
> > suspect that split-brain is not the case, since the test cluster in
> > question on on virtual machines all within a single host, with
> > *very*
> > reliable virtual networking between them.  After reading your
> > response, I
> > did have a quick look at setting up cman to verify either way, but
> > that's
> > not proving to be quick and easy, so I'll come back to it shortly.
> >
> > The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> > interesting.  I'll apply the patch suggested there and see what
> > difference
> > it makes.
> >
> > Thanks again.  I'll let you know how it goes :)
> >
> > pc
> > ----
> > http://colby.id.au
> >
> >
> >
> > On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec
> > <pmoravec@redhat.com>wrote:
> >
> >> Hi Paul,
> >> this usually happens as a consequence of cluster split-brain. Are
> >> you
> >> using CMAN (Cluster Manager)?
> >>
> >> (Technically, when split brain occurs, two (or more) qpid brokers
> >> think
> >> they are the elder nodes (elder node = "the managing" node,
> >> usually the
> >> node that is oldest in the cluster). But there can be just one
> >> elder node
> >> in a cluster, as the elder node periodically invokes
> >> periodicProcessing
> >> task cluster-wide that can run just one at a time. When more elder
> >> nodes
> >> are present, all invokes the task on every cluster member, causing
> >> more
> >> tasks to be executed - that is prevented by broker shutdown.)
> >>
> >> Kind regards,
> >> Pavel Moravec
> >>
> >>
> >> ----- Original Message -----
> >> > From: "Paul Colby" <paul@colby.id.au>
> >> > To: users@qpid.apache.org
> >> > Sent: Thursday, April 12, 2012 5:08:01 AM
> >> > Subject: 0.14 cluster never survives more than an hour or so.
> >> >
> >> > Hi guys,
> >> >
> >> > I'm having an issue with my new 0.14 cluster, where the same
> >> > configuration
> >> > was fine with 0.12.
> >> >
> >> > The cluster starts up, and all brokers are happy.  Then, with no
> >> > client
> >> > activity at all, after some seemingly random amount time
> >> > (usually
> >> > around 30
> >> > minutes to an hour) all brokers in the cluster (three, in this
> >> > case)
> >> > report
> >> > the following error:
> >> >
> >> > critical Error delivering frames: Cluster timer drop
> >> > non-existent
> >> > task
> >> > ManagementAgent::periodicProcessing
> >> > (qpid/cluster/ClusterTimer.cpp:128)
> >> >
> >> > Then they all shutdown, leaving their respective stores dirty :(
> >> >
> >> > Any ideas what might be going wrong here?
> >> >
> >> > Thanks,
> >> >
> >> > pc
> >> > ----
> >> > http://colby.id.au
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> >> For additional commands, e-mail: users-help@qpid.apache.org
> >>
> >>
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Mime
View raw message