Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Peter Bengtson <peter@peterbengtson.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Entire CouchDB cluster crashes simultaneously
Date: Fri, 5 Mar 2010 13:18:18 +0100
Message-Id: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com>
To: user@couchdb.apache.org, dev@couchdb.apache.org
Mime-Version: 1.0 (Apple Message framework v1077)

We have a cluster of servers. At the moment there are three servers, =
each having two separate instances of CouchDB, like this:

	node0-couch1
	node0-couch2

	node1-couch1
	node1-couch2

	node2-couch1
	node2-couch2

All couch1 instances are set up to replicate continuously using =
bidirectional pull replication. That is:

	node0-couch1	pulls from node1-couch1 and node2-couch1
	node1-couch1	pulls from node0-couch1 and node2-couch1
	node2-couch1	pulls from node0-couch1 and node1-couch1

On each node, couch1 and couch2 are set up to replicate each other =
continuously, again using pull replication. Thus, the full replication =
topology is:

	node0-couch1	pulls from node1-couch, node2-couch1, and =
node0-couch2
	node0-couch2	pulls from node0-couch1

	node1-couch1	pulls from node0-couch1, node2-couch1, and =
node1-couch2
	node1-couch2	pulls from node1-couch1

	node2-couch1	pulls from node0-couch1, node1-couch1, and =
node2-couch2
	node2-couch2	pulls from node2-couch1

No proxies are involved. In our staging system, all servers are on the =
same subnet.

The problem is that every night, the entire cluster dies. All instances =
of CouchDB crash, and moreover they crash exactly simultaneously.

The data being replicated is very minimal at the moment - simple log =
text lines, no attachments. The entire database being replicated is no =
more than a few megabytes in size.

The syslogs give no clue. The CouchDB logs are difficult to interpret =
unless you are an Erlang programmer. If anyone would care to look at =
them, just let me know.

Any clues as to why this is happening? We're using 0.10.1 on Debian.

We are planning to build quite sophisticated transcluster job queue =
functionality on top of CouchDB, but of course a situation like this =
suggests that CouchDB replication currently is too unreliable to be of =
practical use, unless this is a known bug and/or a fixed one.

Any pointers or ideas are most welcome.

	/ Peter Bengtson