couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <>
Subject Re: Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 18:52:23 GMT

The couch_sever:open call failing makes me wonder if the timeout there
should be configurable. I plan to patch the replicator's changes
timeout to be configurable and maybe I'll do this while I'm at it
unless someone gets there first.

The "Uncaught error in HTTP request" is actually caught (ha!) at the
place where this message is printed and sent back as a json object to
the client. Now, I don't entirely understand the supervision tree with
the [daemons] config and whatnot, but it seems to me like a socket
error could cause the httpd process to try and send the error response
over a socket in an error or close state.

I'm moving this over to dev@ as well.
Can someone comment on the following one-line patch. I think it should
be harmless and might keep the server from restarting in these kinds
of situations.

>From 4383cd7af22e56e287f027aceb130363a8aac940 Mon Sep 17 00:00:00 2001
From: Randall Leeds <>
Date: Fri, 5 Mar 2010 10:25:25 -0800
Subject: [PATCH] Catch socket errors on error responses.

 src/couchdb/couch_httpd.erl |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/src/couchdb/couch_httpd.erl b/src/couchdb/couch_httpd.erl
index b25242f..3dcf699 100644
--- a/src/couchdb/couch_httpd.erl
+++ b/src/couchdb/couch_httpd.erl
@@ -676,7 +676,7 @@ send_error(Req, Code, ErrorStr, ReasonStr) ->
     send_error(Req, Code, [], ErrorStr, ReasonStr).

 send_error(Req, Code, Headers, ErrorStr, ReasonStr) ->
-    send_json(Req, Code, Headers,
+    catch send_json(Req, Code, Headers,
         {[{<<"error">>,  ErrorStr},
          {<<"reason">>, ReasonStr}]}).


On Fri, Mar 5, 2010 at 09:29, Peter Bengtson <> wrote:
> After conferring with our sysadmins, I found out that there indeed was a backup task
running nightly at approximately the time of the crashes. They have turned it off now. I'll
let you know after the weekend how this affects the replication setup. Keeping my fingers
crossed until then. Thanks!
>        / Peter
> 5 mar 2010 kl. 18.24 skrev Adam Kocoloski:
>> That would be my guess, too.
>> On Mar 5, 2010, at 12:22 PM, Randall Leeds wrote:
>>> Could there be a cron job that's causing a lot of disk contention at the
>>> same time every night?
>>> On Mar 5, 2010 7:24 AM, "Peter Bengtson" <> wrote:
>>> Adam, that's interesting. These crashes occur every night with alarming
>>> regularity, but the staging system on which this runs is under no load to
>>> speak about. And there are only two DBs in the system at this point, both of
>>> which were opened at least 12 hours earlier. I'll ask our sysadmins to
>>> double-check the load, but I'd like to know one thing:
>>> Why do these crashes occur system-wide? On three nodes and six servers? And
>>> at the same time? Somehow, we didn't quite expect that CouchDB should go
>>> quite so far as to replicate the crashes... ;-)
>>>      / Peter
>>> 5 mar 2010 kl. 15.57 skrev Adam Kocoloski:
>>>> From that log we can tell that CouchDB crashed completely on node0-couch2
>>> (because of the "Apache...

View raw message