Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of robert.newson@gmail.com
 designates 74.125.82.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=uZUhNAnK60O0mohX7cNkc0VyLqxRurgt4HJHCWRvy3drfREpSYr6nOqTMQireaZLTB
         f4P5ZALsd7a4icX+am2HT+nex0SuDTZzoekhBucholhMj8HYzsAJ3vP5wk07//BuXiB+
         LtsORq1fVkgUewqqUbDkd6j65HrYYv/ed3udE=
MIME-Version: 1.0
In-Reply-To: <E2B61B53-E5CE-49DA-B833-47423809B5F0@peterbengtson.com>
References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com>
	 <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com>
	 <46aeb24f1003050522p39b87c9fo55fac528b125e3f7@mail.gmail.com>
	 <E0551544-CBDC-4C74-86C4-FFE6BE5BAF2A@peterbengtson.com>
	 <E1B5ABDC-9DAA-424F-8FC6-9DD4228D5F42@apache.org>
	 <293C4DE5-1845-417B-B04D-34E644A38191@peterbengtson.com>
	 <ec7a93a1003050922j89f3fc7sd975b6953af42380@mail.gmail.com>
	 <ec7a93a1003050922u7ea017bufe35928aff6828a1@mail.gmail.com>
	 <BEE07DA8-2686-4C06-8798-F2A6CBDAD748@apache.org>
	 <E2B61B53-E5CE-49DA-B833-47423809B5F0@peterbengtson.com>
Date: Fri, 5 Mar 2010 13:13:50 -0500
Message-ID: <46aeb24f1003051013s2826663ck2cb6b934aebfdde6@mail.gmail.com>
Subject: Re: Entire CouchDB cluster crashes simultaneously
From: Robert Newson <robert.newson@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

fwiw: I use a cron job to establish continuous replication precisely
because they are not persistent. POST'ing to _replicate with the same
source and target is idempotent, so a cron job that mindlessly
resubmits all your replication tasks is harmless.

I go further, since I use pairs of servers, and read _all_dbs from the
other side and kick off a continuous pull replication task, and this
runs every 5 minutes.

B.

On Fri, Mar 5, 2010 at 12:29 PM, Peter Bengtson <peter@peterbengtson.com> w=
rote:
> After conferring with our sysadmins, I found out that there indeed was a =
backup task running nightly at approximately the time of the crashes. They =
have turned it off now. I'll let you know after the weekend how this affect=
s the replication setup. Keeping my fingers crossed until then. Thanks!
>
> =A0 =A0 =A0 =A0/ Peter
>
>
> 5 mar 2010 kl. 18.24 skrev Adam Kocoloski:
>
>> That would be my guess, too.
>>
>> On Mar 5, 2010, at 12:22 PM, Randall Leeds wrote:
>>
>>> Could there be a cron job that's causing a lot of disk contention at th=
e
>>> same time every night?
>>>
>>> On Mar 5, 2010 7:24 AM, "Peter Bengtson" <peter@peterbengtson.com> wrot=
e:
>>>
>>> Adam, that's interesting. These crashes occur every night with alarming
>>> regularity, but the staging system on which this runs is under no load =
to
>>> speak about. And there are only two DBs in the system at this point, bo=
th of
>>> which were opened at least 12 hours earlier. I'll ask our sysadmins to
>>> double-check the load, but I'd like to know one thing:
>>>
>>> Why do these crashes occur system-wide? On three nodes and six servers?=
 And
>>> at the same time? Somehow, we didn't quite expect that CouchDB should g=
o
>>> quite so far as to replicate the crashes... ;-)
>>>
>>> =A0 =A0 =A0/ Peter
>>>
>>>
>>> 5 mar 2010 kl. 15.57 skrev Adam Kocoloski:
>>>
>>>
>>>> From that log we can tell that CouchDB crashed completely on node0-cou=
ch2
>>> (because of the "Apache...
>>
>
>