Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1077)
Subject: Re: Entire CouchDB cluster crashes simultaneously
From: Peter Bengtson <peter@peterbengtson.com>
In-Reply-To: <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com>
Date: Fri, 5 Mar 2010 14:15:02 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com>
References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com>
 <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com>
To: user@couchdb.apache.org

The amount of logged data on the six servers is vast, but this is the =
crash message on node0-couch1. It's perhaps easier if I make the full =
log files available (give me a shout). Here's the snippet:

[Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server =
<0.2092.0> terminating=20
** Last message in was {ibrowse_async_response,
                           {1267,713465,777255},
                           {error,connection_closed}}
** When Server state =3D=3D {state,nil,nil,
                            [<0.2077.0>,
                             {http_db,
                                 =
"http://couch2.staging.diino.com:5984/laplace_conf_staging/",
                                 [{"User-Agent","CouchDB/0.10.1"},
                                  {"Accept","application/json"},
                                  {"Accept-Encoding","gzip"}],
                                 [],get,nil,
                                 [{response_format,binary},
                                  {inactivity_timeout,30000}],
                                 10,500,nil},
                             251,
                             [{<<"continuous">>,true},
                              {<<"source">>,
                               =
<<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
                              {<<"target">>,
                               =
<<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
                            251,<0.2093.0>,
                            {1267,713465,777255},
                            false,0,<<>>,
                            {<0.2095.0>,#Ref<0.0.0.131534>},
** Reason for termination =3D=3D=20
** {error,connection_closed}
[Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server =
<0.2130.0> terminating=20
** Last message in was {ibrowse_async_response,
                           {1267,713465,843079},
                           {error,connection_closed}}
** When Server state =3D=3D {state,nil,nil,
                            [<0.2106.0>,
                             {http_db,
                                 =
"http://couch2.staging.diino.com:5984/laplace_log_staging/",
                                 [{"User-Agent","CouchDB/0.10.1"},
                                  {"Accept","application/json"},
                                  {"Accept-Encoding","gzip"}],
                                 [],get,nil,
                                 [{response_format,binary},
                                  {inactivity_timeout,30000}],
                                 10,500,nil},
                             28136,
                             [{<<"continuous">>,true},
                              {<<"source">>,
                               =
<<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
                              {<<"target">>,
                               =
<<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
                            29086,<0.2131.0>,
                            {1267,713465,843079},
                            false,0,<<>>,
                            {<0.2133.0>,#Ref<0.0.5.183681>},
** Reason for termination =3D=3D=20
** {error,connection_closed}


On 5 mar 2010, at 13.44, Robert Newson wrote:

> Can you include some of the log output?
>=20
> A coordinated failure like this points to external factors but log
> output will help in any case.
>=20
> B.
>=20
> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson =
<peter@peterbengtson.com> wrote:
>> We have a cluster of servers. At the moment there are three servers, =
each having two separate instances of CouchDB, like this:
>>=20
>>        node0-couch1
>>        node0-couch2
>>=20
>>        node1-couch1
>>        node1-couch2
>>=20
>>        node2-couch1
>>        node2-couch2
>>=20
>> All couch1 instances are set up to replicate continuously using =
bidirectional pull replication. That is:
>>=20
>>        node0-couch1    pulls from node1-couch1 and node2-couch1
>>        node1-couch1    pulls from node0-couch1 and node2-couch1
>>        node2-couch1    pulls from node0-couch1 and node1-couch1
>>=20
>> On each node, couch1 and couch2 are set up to replicate each other =
continuously, again using pull replication. Thus, the full replication =
topology is:
>>=20
>>        node0-couch1    pulls from node1-couch, node2-couch1, and =
node0-couch2
>>        node0-couch2    pulls from node0-couch1
>>=20
>>        node1-couch1    pulls from node0-couch1, node2-couch1, and =
node1-couch2
>>        node1-couch2    pulls from node1-couch1
>>=20
>>        node2-couch1    pulls from node0-couch1, node1-couch1, and =
node2-couch2
>>        node2-couch2    pulls from node2-couch1
>>=20
>> No proxies are involved. In our staging system, all servers are on =
the same subnet.
>>=20
>> The problem is that every night, the entire cluster dies. All =
instances of CouchDB crash, and moreover they crash exactly =
simultaneously.
>>=20
>> The data being replicated is very minimal at the moment - simple log =
text lines, no attachments. The entire database being replicated is no =
more than a few megabytes in size.
>>=20
>> The syslogs give no clue. The CouchDB logs are difficult to interpret =
unless you are an Erlang programmer. If anyone would care to look at =
them, just let me know.
>>=20
>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>=20
>> We are planning to build quite sophisticated transcluster job queue =
functionality on top of CouchDB, but of course a situation like this =
suggests that CouchDB replication currently is too unreliable to be of =
practical use, unless this is a known bug and/or a fixed one.
>>=20
>> Any pointers or ideas are most welcome.
>>=20
>>        / Peter Bengtson
>>=20
>>=20
>>=20