From user-return-9150-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Fri Mar 05 13:23:31 2010 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 40061 invoked from network); 5 Mar 2010 13:23:31 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Mar 2010 13:23:31 -0000 Received: (qmail 33412 invoked by uid 500); 5 Mar 2010 13:23:16 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 33361 invoked by uid 500); 5 Mar 2010 13:23:16 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 33353 invoked by uid 99); 5 Mar 2010 13:23:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 13:23:16 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of robert.newson@gmail.com designates 74.125.82.180 as permitted sender) Received: from [74.125.82.180] (HELO mail-wy0-f180.google.com) (74.125.82.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Mar 2010 13:23:14 +0000 Received: by wyb35 with SMTP id 35so1906232wyb.11 for ; Fri, 05 Mar 2010 05:22:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=lnemBr7YDB+vLU5RoMYV2bsItVJ/NurlMZu1elX/Qig=; b=kijJYt4BuDvoCeZmj+SecOTox0jUS1obzSjo5mhKBMrPP22EEKQqG9lg8VTDVreq1w Py4ForuJwo1bESwPD780tiSANFXmXn+wVrNnwxsTAj8sR5GPXJHlRCaFiro/1vr7H/Aj MxMGHH85aUQ3hTh7j2EOafKNU1/F5TEB4K3Fg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=orqj+q5df+mi3x4jp3Axq6I1A4Hfnpm8DireuL0cMrg9Lfa/SsV91UFNzD2k8OsLGq E+jiK0fNO7Bym0ZUDvzzlaTlcBqnAHK1MBxjk4jdDLwxMNy3jfvuQQ6CWvSNkWV1s1v1 GrIn/3Gi9WrbVWlshIX3h5Z3w1pHKqGa+samg= MIME-Version: 1.0 Received: by 10.216.88.15 with SMTP id z15mr100478wee.113.1267795372633; Fri, 05 Mar 2010 05:22:52 -0800 (PST) In-Reply-To: <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com> References: <8020EF80-7148-41DD-B96A-34C4F35B6A39@peterbengtson.com> <46aeb24f1003050444g6a2b29f8v2ed6abfa56a4de43@mail.gmail.com> <3621FFB3-FA15-459A-8FA0-7845CE14DA0D@peterbengtson.com> Date: Fri, 5 Mar 2010 08:22:52 -0500 Message-ID: <46aeb24f1003050522p39b87c9fo55fac528b125e3f7@mail.gmail.com> Subject: Re: Entire CouchDB cluster crashes simultaneously From: Robert Newson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Is couchdb crashing or just the replication tasks? On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson wr= ote: > The amount of logged data on the six servers is vast, but this is the cra= sh message on node0-couch1. It's perhaps easier if I make the full log file= s available (give me a shout). Here's the snippet: > > [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server <0= .2092.0> terminating > ** Last message in was {ibrowse_async_response, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {1267,713465,777255}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {error,connection_clo= sed}} > ** When Server state =3D=3D {state,nil,nil, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[<0.2077.0>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {http_db, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "http://c= ouch2.staging.diino.com:5984/laplace_conf_staging/", > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{"User-A= gent","CouchDB/0.10.1"}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{"Acce= pt","application/json"}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{"Acce= pt-Encoding","gzip"}], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [],get,ni= l, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{respons= e_format,binary}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{inact= ivity_timeout,30000}], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 10,500,ni= l}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 251, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{<<"continuous">= >,true}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<<"source">>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <<"http://cou= ch2.staging.diino.com:5984/laplace_conf_staging">>}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<<"target">>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <<"http://cou= ch1.staging.diino.com:5984/laplace_conf_staging">>}]], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0251,<0.2093.0>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{1267,713465,77725= 5}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0false,0,<<>>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<0.2095.0>,#Ref<0= .0.0.131534>}, > ** Reason for termination =3D=3D > ** {error,connection_closed} > [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server <0= .2130.0> terminating > ** Last message in was {ibrowse_async_response, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {1267,713465,843079}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {error,connection_clo= sed}} > ** When Server state =3D=3D {state,nil,nil, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[<0.2106.0>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {http_db, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 "http://c= ouch2.staging.diino.com:5984/laplace_log_staging/", > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{"User-A= gent","CouchDB/0.10.1"}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{"Acce= pt","application/json"}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{"Acce= pt-Encoding","gzip"}], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [],get,ni= l, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{respons= e_format,binary}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{inact= ivity_timeout,30000}], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 10,500,ni= l}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 28136, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{<<"continuous">= >,true}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<<"source">>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <<"http://cou= ch2.staging.diino.com:5984/laplace_log_staging">>}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<<"target">>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 <<"http://cou= ch1.staging.diino.com:5984/laplace_log_staging">>}]], > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A029086,<0.2131.0>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{1267,713465,84307= 9}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0false,0,<<>>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<0.2133.0>,#Ref<0= .0.5.183681>}, > ** Reason for termination =3D=3D > ** {error,connection_closed} > > > > On 5 mar 2010, at 13.44, Robert Newson wrote: > >> Can you include some of the log output? >> >> A coordinated failure like this points to external factors but log >> output will help in any case. >> >> B. >> >> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson = wrote: >>> We have a cluster of servers. At the moment there are three servers, ea= ch having two separate instances of CouchDB, like this: >>> >>> =A0 =A0 =A0 =A0node0-couch1 >>> =A0 =A0 =A0 =A0node0-couch2 >>> >>> =A0 =A0 =A0 =A0node1-couch1 >>> =A0 =A0 =A0 =A0node1-couch2 >>> >>> =A0 =A0 =A0 =A0node2-couch1 >>> =A0 =A0 =A0 =A0node2-couch2 >>> >>> All couch1 instances are set up to replicate continuously using bidirec= tional pull replication. That is: >>> >>> =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch1 and node2-co= uch1 >>> =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1 and node2-co= uch1 >>> =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1 and node1-co= uch1 >>> >>> On each node, couch1 and couch2 are set up to replicate each other cont= inuously, again using pull replication. Thus, the full replication topology= is: >>> >>> =A0 =A0 =A0 =A0node0-couch1 =A0 =A0pulls from node1-couch, node2-couch1= , and node0-couch2 >>> =A0 =A0 =A0 =A0node0-couch2 =A0 =A0pulls from node0-couch1 >>> >>> =A0 =A0 =A0 =A0node1-couch1 =A0 =A0pulls from node0-couch1, node2-couch= 1, and node1-couch2 >>> =A0 =A0 =A0 =A0node1-couch2 =A0 =A0pulls from node1-couch1 >>> >>> =A0 =A0 =A0 =A0node2-couch1 =A0 =A0pulls from node0-couch1, node1-couch= 1, and node2-couch2 >>> =A0 =A0 =A0 =A0node2-couch2 =A0 =A0pulls from node2-couch1 >>> >>> No proxies are involved. In our staging system, all servers are on the = same subnet. >>> >>> The problem is that every night, the entire cluster dies. All instances= of CouchDB crash, and moreover they crash exactly simultaneously. >>> >>> The data being replicated is very minimal at the moment - simple log te= xt lines, no attachments. The entire database being replicated is no more t= han a few megabytes in size. >>> >>> The syslogs give no clue. The CouchDB logs are difficult to interpret u= nless you are an Erlang programmer. If anyone would care to look at them, j= ust let me know. >>> >>> Any clues as to why this is happening? We're using 0.10.1 on Debian. >>> >>> We are planning to build quite sophisticated transcluster job queue fun= ctionality on top of CouchDB, but of course a situation like this suggests = that CouchDB replication currently is too unreliable to be of practical use= , unless this is a known bug and/or a fixed one. >>> >>> Any pointers or ideas are most welcome. >>> >>> =A0 =A0 =A0 =A0/ Peter Bengtson >>> >>> >>> > >