Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 20613 invoked from network); 5 Mar 2010 12:19:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Mar 2010 12:19:04 -0000 Received: (qmail 66257 invoked by uid 500); 5 Mar 2010 12:18:49 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 66168 invoked by uid 500); 5 Mar 2010 12:18:48 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 66153 invoked by uid 99); 5 Mar 2010 12:18:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 12:18:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [213.80.33.227] (HELO sphinx.naradek.org) (213.80.33.227) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 12:18:39 +0000 Received: from [172.16.0.166] (static-213-115-7-202.sme.bredbandsbolaget.se [213.115.7.202]) by sphinx.naradek.org (Postfix) with ESMTP id 35571149EDA2; Fri, 5 Mar 2010 13:18:18 +0100 (CET) From: Peter Bengtson Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Entire CouchDB cluster crashes simultaneously Date: Fri, 5 Mar 2010 13:18:18 +0100 Message-Id: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> To: user@couchdb.apache.org, dev@couchdb.apache.org Mime-Version: 1.0 (Apple Message framework v1077) X-Mailer: Apple Mail (2.1077) X-Virus-Checked: Checked by ClamAV on apache.org We have a cluster of servers. At the moment there are three servers, = each having two separate instances of CouchDB, like this: node0-couch1 node0-couch2 node1-couch1 node1-couch2 node2-couch1 node2-couch2 All couch1 instances are set up to replicate continuously using = bidirectional pull replication. That is: node0-couch1 pulls from node1-couch1 and node2-couch1 node1-couch1 pulls from node0-couch1 and node2-couch1 node2-couch1 pulls from node0-couch1 and node1-couch1 On each node, couch1 and couch2 are set up to replicate each other = continuously, again using pull replication. Thus, the full replication = topology is: node0-couch1 pulls from node1-couch, node2-couch1, and = node0-couch2 node0-couch2 pulls from node0-couch1 node1-couch1 pulls from node0-couch1, node2-couch1, and = node1-couch2 node1-couch2 pulls from node1-couch1 node2-couch1 pulls from node0-couch1, node1-couch1, and = node2-couch2 node2-couch2 pulls from node2-couch1 No proxies are involved. In our staging system, all servers are on the = same subnet. The problem is that every night, the entire cluster dies. All instances = of CouchDB crash, and moreover they crash exactly simultaneously. The data being replicated is very minimal at the moment - simple log = text lines, no attachments. The entire database being replicated is no = more than a few megabytes in size. The syslogs give no clue. The CouchDB logs are difficult to interpret = unless you are an Erlang programmer. If anyone would care to look at = them, just let me know. Any clues as to why this is happening? We're using 0.10.1 on Debian. We are planning to build quite sophisticated transcluster job queue = functionality on top of CouchDB, but of course a situation like this = suggests that CouchDB replication currently is too unreliable to be of = practical use, unless this is a known bug and/or a fixed one. Any pointers or ideas are most welcome. / Peter Bengtson