Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 38452 invoked from network); 5 Mar 2010 13:15:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Mar 2010 13:15:44 -0000 Received: (qmail 22878 invoked by uid 500); 5 Mar 2010 13:15:29 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 22850 invoked by uid 500); 5 Mar 2010 13:15:29 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 22840 invoked by uid 99); 5 Mar 2010 13:15:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 13:15:29 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [213.80.33.227] (HELO sphinx.naradek.org) (213.80.33.227) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 13:15:25 +0000 Received: from [172.16.0.166] (static-213-115-7-202.sme.bredbandsbolaget.se [213.115.7.202]) by sphinx.naradek.org (Postfix) with ESMTP id 71FB2149EECB for ; Fri, 5 Mar 2010 14:15:03 +0100 (CET) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1077) Subject: Re: Entire CouchDB cluster crashes simultaneously From: Peter Bengtson In-Reply-To: <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com> Date: Fri, 5 Mar 2010 14:15:02 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com> References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1077) The amount of logged data on the six servers is vast, but this is the = crash message on node0-couch1. It's perhaps easier if I make the full = log files available (give me a shout). Here's the snippet: [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server = <0.2092.0> terminating=20 ** Last message in was {ibrowse_async_response, {1267,713465,777255}, {error,connection_closed}} ** When Server state =3D=3D {state,nil,nil, [<0.2077.0>, {http_db, = "http://couch2.staging.diino.com:5984/laplace_conf_staging/", [{"User-Agent","CouchDB/0.10.1"}, {"Accept","application/json"}, {"Accept-Encoding","gzip"}], [],get,nil, [{response_format,binary}, {inactivity_timeout,30000}], 10,500,nil}, 251, [{<<"continuous">>,true}, {<<"source">>, = <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>}, {<<"target">>, = <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]], 251,<0.2093.0>, {1267,713465,777255}, false,0,<<>>, {<0.2095.0>,#Ref<0.0.0.131534>}, ** Reason for termination =3D=3D=20 ** {error,connection_closed} [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server = <0.2130.0> terminating=20 ** Last message in was {ibrowse_async_response, {1267,713465,843079}, {error,connection_closed}} ** When Server state =3D=3D {state,nil,nil, [<0.2106.0>, {http_db, = "http://couch2.staging.diino.com:5984/laplace_log_staging/", [{"User-Agent","CouchDB/0.10.1"}, {"Accept","application/json"}, {"Accept-Encoding","gzip"}], [],get,nil, [{response_format,binary}, {inactivity_timeout,30000}], 10,500,nil}, 28136, [{<<"continuous">>,true}, {<<"source">>, = <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>}, {<<"target">>, = <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]], 29086,<0.2131.0>, {1267,713465,843079}, false,0,<<>>, {<0.2133.0>,#Ref<0.0.5.183681>}, ** Reason for termination =3D=3D=20 ** {error,connection_closed} On 5 mar 2010, at 13.44, Robert Newson wrote: > Can you include some of the log output? >=20 > A coordinated failure like this points to external factors but log > output will help in any case. >=20 > B. >=20 > On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson = wrote: >> We have a cluster of servers. At the moment there are three servers, = each having two separate instances of CouchDB, like this: >>=20 >> node0-couch1 >> node0-couch2 >>=20 >> node1-couch1 >> node1-couch2 >>=20 >> node2-couch1 >> node2-couch2 >>=20 >> All couch1 instances are set up to replicate continuously using = bidirectional pull replication. That is: >>=20 >> node0-couch1 pulls from node1-couch1 and node2-couch1 >> node1-couch1 pulls from node0-couch1 and node2-couch1 >> node2-couch1 pulls from node0-couch1 and node1-couch1 >>=20 >> On each node, couch1 and couch2 are set up to replicate each other = continuously, again using pull replication. Thus, the full replication = topology is: >>=20 >> node0-couch1 pulls from node1-couch, node2-couch1, and = node0-couch2 >> node0-couch2 pulls from node0-couch1 >>=20 >> node1-couch1 pulls from node0-couch1, node2-couch1, and = node1-couch2 >> node1-couch2 pulls from node1-couch1 >>=20 >> node2-couch1 pulls from node0-couch1, node1-couch1, and = node2-couch2 >> node2-couch2 pulls from node2-couch1 >>=20 >> No proxies are involved. In our staging system, all servers are on = the same subnet. >>=20 >> The problem is that every night, the entire cluster dies. All = instances of CouchDB crash, and moreover they crash exactly = simultaneously. >>=20 >> The data being replicated is very minimal at the moment - simple log = text lines, no attachments. The entire database being replicated is no = more than a few megabytes in size. >>=20 >> The syslogs give no clue. The CouchDB logs are difficult to interpret = unless you are an Erlang programmer. If anyone would care to look at = them, just let me know. >>=20 >> Any clues as to why this is happening? We're using 0.10.1 on Debian. >>=20 >> We are planning to build quite sophisticated transcluster job queue = functionality on top of CouchDB, but of course a situation like this = suggests that CouchDB replication currently is too unreliable to be of = practical use, unless this is a known bug and/or a fixed one. >>=20 >> Any pointers or ideas are most welcome. >>=20 >> / Peter Bengtson >>=20 >>=20 >>=20