Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of robert.newson@gmail.com
 designates 74.125.82.180 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=ZzMD2HelUglh313tfOEshvMbJpo3BFma/Os8e9bMx8JPYaSHkzfgyPySpLt1bNDZWo
         h/87mJkDNsWNNiAyfTGgyDwfCNPK24ELSywyVlD47JDzcT2+1lbyJjVEmXicdOQtHpIF
         bGHxpFHfyMw53HSF1q+hvo5mxqH/ckqHfS0wo=
MIME-Version: 1.0
In-Reply-To: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com>
References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com>
Date: Fri, 5 Mar 2010 07:44:23 -0500
Message-ID: <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com>
Subject: Re: Entire CouchDB cluster crashes simultaneously
From: Robert Newson <robert.newson@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Can you include some of the log output?

A coordinated failure like this points to external factors but log
output will help in any case.

B.

On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <peter@peterbengtson.com> wr=
ote:
> We have a cluster of servers. At the moment there are three servers, each=
 having two separate instances of CouchDB, like this:
>
> =A0 =A0 =A0 =A0node0-couch1
> =A0 =A0 =A0 =A0node0-couch2
>
> =A0 =A0 =A0 =A0node1-couch1
> =A0 =A0 =A0 =A0node1-couch2
>
> =A0 =A0 =A0 =A0node2-couch1
> =A0 =A0 =A0 =A0node2-couch2
>
> All couch1 instances are set up to replicate continuously using bidirecti=
onal pull replication. That is:
>
> =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch1 and node2-couc=
h1
> =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1 and node2-couc=
h1
> =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1 and node1-couc=
h1
>
> On each node, couch1 and couch2 are set up to replicate each other contin=
uously, again using pull replication. Thus, the full replication topology i=
s:
>
> =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch, node2-couch1, =
and node0-couch2
> =A0 =A0 =A0 =A0node0-couch2 =A0 =A0pulls from node0-couch1
>
> =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1, node2-couch1,=
 and node1-couch2
> =A0 =A0 =A0 =A0node1-couch2 =A0 =A0pulls from node1-couch1
>
> =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1, node1-couch1,=
 and node2-couch2
> =A0 =A0 =A0 =A0node2-couch2 =A0 =A0pulls from node2-couch1
>
> No proxies are involved. In our staging system, all servers are on the sa=
me subnet.
>
> The problem is that every night, the entire cluster dies. All instances o=
f CouchDB crash, and moreover they crash exactly simultaneously.
>
> The data being replicated is very minimal at the moment - simple log text=
 lines, no attachments. The entire database being replicated is no more tha=
n a few megabytes in size.
>
> The syslogs give no clue. The CouchDB logs are difficult to interpret unl=
ess you are an Erlang programmer. If anyone would care to look at them, jus=
t let me know.
>
> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>
> We are planning to build quite sophisticated transcluster job queue funct=
ionality on top of CouchDB, but of course a situation like this suggests th=
at CouchDB replication currently is too unreliable to be of practical use, =
unless this is a known bug and/or a fixed one.
>
> Any pointers or ideas are most welcome.
>
> =A0 =A0 =A0 =A0/ Peter Bengtson
>
>
>