Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 18136 invoked from network); 5 Mar 2010 18:14:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Mar 2010 18:14:33 -0000 Received: (qmail 80805 invoked by uid 500); 5 Mar 2010 18:14:17 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 80775 invoked by uid 500); 5 Mar 2010 18:14:17 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 80767 invoked by uid 99); 5 Mar 2010 18:14:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 18:14:17 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of robert.newson@gmail.com designates 74.125.82.52 as permitted sender) Received: from [74.125.82.52] (HELO mail-ww0-f52.google.com) (74.125.82.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 18:14:10 +0000 Received: by wwe15 with SMTP id 15so2048909wwe.11 for ; Fri, 05 Mar 2010 10:13:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=eLLo9OaAibCn9Qwiqd5Xa81cNUylumVbfn1KEjrEWXI=; b=P3TIFsRyaNV9n/QBMMcsAcbXfTDV5qfF9dPcT3k+ML80D0Jy+FFTcSWNCjtPuPV34f tGw7xdz8sGmgtXPm4HJD8OZTUWjkLPc7hqyCFh4xlaybsYpLOZlRGTvYN6qDJTFcApXJ ZicsstFrwSge6VvQ5JgtFfLPH+5uiTVTO9yLU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=uZUhNAnK60O0mohX7cNkc0VyLqxRurgt4HJHCWRvy3drfREpSYr6nOqTMQireaZLTB f4P5ZALsd7a4icX+am2HT+nex0SuDTZzoekhBucholhMj8HYzsAJ3vP5wk07//BuXiB+ LtsORq1fVkgUewqqUbDkd6j65HrYYv/ed3udE= MIME-Version: 1.0 Received: by 10.216.88.15 with SMTP id z15mr347860wee.113.1267812830429; Fri, 05 Mar 2010 10:13:50 -0800 (PST) In-Reply-To: References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com> <46aeb24f1003050522p39b87c9fo55fac528b125e3f7@mail.gmail.com> <293C4DE5-1845-417B-B04D-34E644A38191@peterbengtson.com> Date: Fri, 5 Mar 2010 13:13:50 -0500 Message-ID: <46aeb24f1003051013s2826663ck2cb6b934aebfdde6@mail.gmail.com> Subject: Re: Entire CouchDB cluster crashes simultaneously From: Robert Newson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org fwiw: I use a cron job to establish continuous replication precisely because they are not persistent. POST'ing to _replicate with the same source and target is idempotent, so a cron job that mindlessly resubmits all your replication tasks is harmless. I go further, since I use pairs of servers, and read _all_dbs from the other side and kick off a continuous pull replication task, and this runs every 5 minutes. B. On Fri, Mar 5, 2010 at 12:29 PM, Peter Bengtson w= rote: > After conferring with our sysadmins, I found out that there indeed was a = backup task running nightly at approximately the time of the crashes. They = have turned it off now. I'll let you know after the weekend how this affect= s the replication setup. Keeping my fingers crossed until then. Thanks! > > =A0 =A0 =A0 =A0/ Peter > > > 5 mar 2010 kl. 18.24 skrev Adam Kocoloski: > >> That would be my guess, too. >> >> On Mar 5, 2010, at 12:22 PM, Randall Leeds wrote: >> >>> Could there be a cron job that's causing a lot of disk contention at th= e >>> same time every night? >>> >>> On Mar 5, 2010 7:24 AM, "Peter Bengtson" wrot= e: >>> >>> Adam, that's interesting. These crashes occur every night with alarming >>> regularity, but the staging system on which this runs is under no load = to >>> speak about. And there are only two DBs in the system at this point, bo= th of >>> which were opened at least 12 hours earlier. I'll ask our sysadmins to >>> double-check the load, but I'd like to know one thing: >>> >>> Why do these crashes occur system-wide? On three nodes and six servers?= And >>> at the same time? Somehow, we didn't quite expect that CouchDB should g= o >>> quite so far as to replicate the crashes... ;-) >>> >>> =A0 =A0 =A0/ Peter >>> >>> >>> 5 mar 2010 kl. 15.57 skrev Adam Kocoloski: >>> >>> >>>> From that log we can tell that CouchDB crashed completely on node0-cou= ch2 >>> (because of the "Apache... >> > >