incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Bengtson <pe...@peterbengtson.com>
Subject Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 12:18:18 GMT
We have a cluster of servers. At the moment there are three servers, each having two separate
instances of CouchDB, like this:

	node0-couch1
	node0-couch2

	node1-couch1
	node1-couch2

	node2-couch1
	node2-couch2

All couch1 instances are set up to replicate continuously using bidirectional pull replication.
That is:

	node0-couch1	pulls from node1-couch1 and node2-couch1
	node1-couch1	pulls from node0-couch1 and node2-couch1
	node2-couch1	pulls from node0-couch1 and node1-couch1

On each node, couch1 and couch2 are set up to replicate each other continuously, again using
pull replication. Thus, the full replication topology is:

	node0-couch1	pulls from node1-couch, node2-couch1, and node0-couch2
	node0-couch2	pulls from node0-couch1

	node1-couch1	pulls from node0-couch1, node2-couch1, and node1-couch2
	node1-couch2	pulls from node1-couch1

	node2-couch1	pulls from node0-couch1, node1-couch1, and node2-couch2
	node2-couch2	pulls from node2-couch1

No proxies are involved. In our staging system, all servers are on the same subnet.

The problem is that every night, the entire cluster dies. All instances of CouchDB crash,
and moreover they crash exactly simultaneously.

The data being replicated is very minimal at the moment - simple log text lines, no attachments.
The entire database being replicated is no more than a few megabytes in size.

The syslogs give no clue. The CouchDB logs are difficult to interpret unless you are an Erlang
programmer. If anyone would care to look at them, just let me know.

Any clues as to why this is happening? We're using 0.10.1 on Debian.

We are planning to build quite sophisticated transcluster job queue functionality on top of
CouchDB, but of course a situation like this suggests that CouchDB replication currently is
too unreliable to be of practical use, unless this is a known bug and/or a fixed one.

Any pointers or ideas are most welcome.

	/ Peter Bengtson



Mime
View raw message