incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <>
Subject Re: Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 18:13:50 GMT
fwiw: I use a cron job to establish continuous replication precisely
because they are not persistent. POST'ing to _replicate with the same
source and target is idempotent, so a cron job that mindlessly
resubmits all your replication tasks is harmless.

I go further, since I use pairs of servers, and read _all_dbs from the
other side and kick off a continuous pull replication task, and this
runs every 5 minutes.


On Fri, Mar 5, 2010 at 12:29 PM, Peter Bengtson <> wrote:
> After conferring with our sysadmins, I found out that there indeed was a backup task
running nightly at approximately the time of the crashes. They have turned it off now. I'll
let you know after the weekend how this affects the replication setup. Keeping my fingers
crossed until then. Thanks!
>        / Peter
> 5 mar 2010 kl. 18.24 skrev Adam Kocoloski:
>> That would be my guess, too.
>> On Mar 5, 2010, at 12:22 PM, Randall Leeds wrote:
>>> Could there be a cron job that's causing a lot of disk contention at the
>>> same time every night?
>>> On Mar 5, 2010 7:24 AM, "Peter Bengtson" <> wrote:
>>> Adam, that's interesting. These crashes occur every night with alarming
>>> regularity, but the staging system on which this runs is under no load to
>>> speak about. And there are only two DBs in the system at this point, both of
>>> which were opened at least 12 hours earlier. I'll ask our sysadmins to
>>> double-check the load, but I'd like to know one thing:
>>> Why do these crashes occur system-wide? On three nodes and six servers? And
>>> at the same time? Somehow, we didn't quite expect that CouchDB should go
>>> quite so far as to replicate the crashes... ;-)
>>>      / Peter
>>> 5 mar 2010 kl. 15.57 skrev Adam Kocoloski:
>>>> From that log we can tell that CouchDB crashed completely on node0-couch2
>>> (because of the "Apache...

View raw message