couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ciprian Trusca (JIRA)" <j...@apache.org>
Subject [jira] [Created] (COUCHDB-2496) compaction repeated timeouts causes the server to shutdown temporary when replication is broken
Date Mon, 08 Dec 2014 13:02:12 GMT
Ciprian Trusca created COUCHDB-2496:
---------------------------------------

             Summary: compaction repeated timeouts causes the server to shutdown temporary
when replication is broken
                 Key: COUCHDB-2496
                 URL: https://issues.apache.org/jira/browse/COUCHDB-2496
             Project: CouchDB
          Issue Type: Bug
      Security Level: public (Regular issues)
          Components: Database Core
            Reporter: Ciprian Trusca


We have the following setup:

* two CouchDB machines with replication enabled between them 
* a watchdog running every 5 minutes, which verifies the status of the _replicator documents.
If one of those documents has _replication_state = error, the watchdog deletes it and creates
a new one with the exact same parameters.

For this test, one CouchDB machine is shut down so the watchdog will continuously recreate
the _replicator documents, and that will cause the _replicator database to get fragmented.


Several times the couch.log state that this database is fragmented over the 70% threshold,
but then there isn't any evidence that the compaction for the _replicator database is started.
Instead, after approximately we get the following error
{code}
** Reason for termination ==
 ** {compaction_loop_died, 
       {timeout,{gen_server,call,[<0.117.0>,start_compact]}}}
{code}

The worse part is that, from time to time the error appears several times in a short interval
of time (eg. 3 times / 60 seconds) and this causes the whole CouchDB server to crash with:
{code}
[error] [<0.93.0>] {error_report,<0.30.0>,
                       {<0.93.0>,supervisor_report,
                        [{supervisor,{local,couch_secondary_services}},
                         {errorContext,shutdown},
                         {reason,reached_max_restart_intensity},
                         {offender,
                             [{pid,<0.10114.14>},
                              {name,compaction_daemon},
                              {mfargs,{couch_compaction_daemon,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,brutal_kill},
                              {child_type,worker}]}]}}
{code}
 
All the subsequent requests to CouchDb are then refused for a period of time ( we measured
between 3 and 50 minutes). 

Because this is a heavy load test we isolated CouchDb in a ramdisk in order to make sure that
this is not a disk usage problem, but the error persists

We are running CouchDB 1.6.1 on a Centos 6.4 machine. 
Please let me know if additional information is required. 
Thank you.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message