couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ciprian Trusca <CTru...@totalsoft.ro>
Subject RE: compaction repeated timeouts causes the server to shutdown temporary when replication is broken
Date Fri, 05 Dec 2014 07:51:56 GMT
We have turned on debugging for this test and it looks like the cause of this error is the
_replicator database.  

After the list of fragmented databases we see no evidence that the compaction for this database
is being started in the log ( although fragmentation is and above the 70% threshold) and then
we have the compaction loop dying  after approximately 5 seconds.  So I am guessing CouchDB
fails to spawn the compaction process.  

I forgot to mention in the first post that we are running CouchDB 1.6.1 on a Centos 6.4 server.

Thanks for your time, any help will be appreciated.

-----Original Message-----
From: Ciprian Trusca [mailto:CTrusca@totalsoft.ro] 
Sent: Thursday, November 27, 2014 10:17 AM
To: user@couchdb.apache.org
Subject: compaction repeated timeouts causes the server to shutdown temporary when replication
is broken

Hello all,
we have encountered the following situation during an overnight load test.

We get the following message repeatedly in the couch logs:

** Reason for termination ==

** {compaction_loop_died,

       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}



At one time, we are getting it three times in an interval of 5 seconds and I am guessing this
causes the supervisor to shutdown temporary:


[Thu, 20 Nov 2014 05:58:33 GMT] [error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}



In this particular component load test the CouchDB peer is shutdown so the replication is
broken, meaning that there are a lot of backgrounds processes that try to replicate and die,
and there is a thread that removes the failed replication and re-enables them (probably this
is not a good idea anymore since CouchDB detects that the peer came back online on its own
now).  I suspect that this might be related.



In the Zenoss graphs we see a very significant spike in the IO read /writes at that moment.



Thank you very much for your time, and any hint will be appreciated.

Mime
View raw message