Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 33371 invoked from network); 5 Mar 2010 12:45:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Mar 2010 12:45:05 -0000 Received: (qmail 94709 invoked by uid 500); 5 Mar 2010 12:44:50 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 94674 invoked by uid 500); 5 Mar 2010 12:44:50 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 94666 invoked by uid 99); 5 Mar 2010 12:44:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 12:44:50 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robert.newson@gmail.com designates 74.125.82.180 as permitted sender) Received: from [74.125.82.180] (HELO mail-wy0-f180.google.com) (74.125.82.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 12:44:45 +0000 Received: by wyb35 with SMTP id 35so1886716wyb.11 for ; Fri, 05 Mar 2010 04:44:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ANn37MiAMElOuOY33j/qsOVaLeBjDDHZ24jw9NxBmi8=; b=vNBP9bFc8jMG65uWUr6+S1yLxgaX46NctC5eK/4yMdMAAk7ty4rlpeNOdjIPI7jQVF 4vJoPxCYwU8PTOXKUZUIYTs4QHMc7wZhe037cCzBpDBwqxDwoRr4mpzLT84wP1HImJvj RXXeRbFelNJ0DjelfB1mobvHX1bsfq9gJiLDc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=ZzMD2HelUglh313tfOEshvMbJpo3BFma/Os8e9bMx8JPYaSHkzfgyPySpLt1bNDZWo h/87mJkDNsWNNiAyfTGgyDwfCNPK24ELSywyVlD47JDzcT2+1lbyJjVEmXicdOQtHpIF bGHxpFHfyMw53HSF1q+hvo5mxqH/ckqHfS0wo= MIME-Version: 1.0 Received: by 10.216.89.138 with SMTP id c10mr80532wef.47.1267793063826; Fri, 05 Mar 2010 04:44:23 -0800 (PST) In-Reply-To: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> Date: Fri, 5 Mar 2010 07:44:23 -0500 Message-ID: <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com> Subject: Re: Entire CouchDB cluster crashes simultaneously From: Robert Newson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Can you include some of the log output? A coordinated failure like this points to external factors but log output will help in any case. B. On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson wr= ote: > We have a cluster of servers. At the moment there are three servers, each= having two separate instances of CouchDB, like this: > > =A0 =A0 =A0 =A0node0-couch1 > =A0 =A0 =A0 =A0node0-couch2 > > =A0 =A0 =A0 =A0node1-couch1 > =A0 =A0 =A0 =A0node1-couch2 > > =A0 =A0 =A0 =A0node2-couch1 > =A0 =A0 =A0 =A0node2-couch2 > > All couch1 instances are set up to replicate continuously using bidirecti= onal pull replication. That is: > > =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch1 and node2-couc= h1 > =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1 and node2-couc= h1 > =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1 and node1-couc= h1 > > On each node, couch1 and couch2 are set up to replicate each other contin= uously, again using pull replication. Thus, the full replication topology i= s: > > =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch, node2-couch1, = and node0-couch2 > =A0 =A0 =A0 =A0node0-couch2 =A0 =A0pulls from node0-couch1 > > =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1, node2-couch1,= and node1-couch2 > =A0 =A0 =A0 =A0node1-couch2 =A0 =A0pulls from node1-couch1 > > =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1, node1-couch1,= and node2-couch2 > =A0 =A0 =A0 =A0node2-couch2 =A0 =A0pulls from node2-couch1 > > No proxies are involved. In our staging system, all servers are on the sa= me subnet. > > The problem is that every night, the entire cluster dies. All instances o= f CouchDB crash, and moreover they crash exactly simultaneously. > > The data being replicated is very minimal at the moment - simple log text= lines, no attachments. The entire database being replicated is no more tha= n a few megabytes in size. > > The syslogs give no clue. The CouchDB logs are difficult to interpret unl= ess you are an Erlang programmer. If anyone would care to look at them, jus= t let me know. > > Any clues as to why this is happening? We're using 0.10.1 on Debian. > > We are planning to build quite sophisticated transcluster job queue funct= ionality on top of CouchDB, but of course a situation like this suggests th= at CouchDB replication currently is too unreliable to be of practical use, = unless this is a known bug and/or a fixed one. > > Any pointers or ideas are most welcome. > > =A0 =A0 =A0 =A0/ Peter Bengtson > > >