incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <robert.new...@gmail.com>
Subject Re: Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 13:22:52 GMT
Is couchdb crashing or just the replication tasks?

On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <peter@peterbengtson.com> wrote:
> The amount of logged data on the six servers is vast, but this is the crash message on
node0-couch1. It's perhaps easier if I make the full log files available (give me a shout).
Here's the snippet:
>
> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server <0.2092.0>
terminating
> ** Last message in was {ibrowse_async_response,
>                           {1267,713465,777255},
>                           {error,connection_closed}}
> ** When Server state == {state,nil,nil,
>                            [<0.2077.0>,
>                             {http_db,
>                                 "http://couch2.staging.diino.com:5984/laplace_conf_staging/",
>                                 [{"User-Agent","CouchDB/0.10.1"},
>                                  {"Accept","application/json"},
>                                  {"Accept-Encoding","gzip"}],
>                                 [],get,nil,
>                                 [{response_format,binary},
>                                  {inactivity_timeout,30000}],
>                                 10,500,nil},
>                             251,
>                             [{<<"continuous">>,true},
>                              {<<"source">>,
>                               <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
>                              {<<"target">>,
>                               <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
>                            251,<0.2093.0>,
>                            {1267,713465,777255},
>                            false,0,<<>>,
>                            {<0.2095.0>,#Ref<0.0.0.131534>},
> ** Reason for termination ==
> ** {error,connection_closed}
> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server <0.2130.0>
terminating
> ** Last message in was {ibrowse_async_response,
>                           {1267,713465,843079},
>                           {error,connection_closed}}
> ** When Server state == {state,nil,nil,
>                            [<0.2106.0>,
>                             {http_db,
>                                 "http://couch2.staging.diino.com:5984/laplace_log_staging/",
>                                 [{"User-Agent","CouchDB/0.10.1"},
>                                  {"Accept","application/json"},
>                                  {"Accept-Encoding","gzip"}],
>                                 [],get,nil,
>                                 [{response_format,binary},
>                                  {inactivity_timeout,30000}],
>                                 10,500,nil},
>                             28136,
>                             [{<<"continuous">>,true},
>                              {<<"source">>,
>                               <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
>                              {<<"target">>,
>                               <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
>                            29086,<0.2131.0>,
>                            {1267,713465,843079},
>                            false,0,<<>>,
>                            {<0.2133.0>,#Ref<0.0.5.183681>},
> ** Reason for termination ==
> ** {error,connection_closed}
>
>
>
> On 5 mar 2010, at 13.44, Robert Newson wrote:
>
>> Can you include some of the log output?
>>
>> A coordinated failure like this points to external factors but log
>> output will help in any case.
>>
>> B.
>>
>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <peter@peterbengtson.com> wrote:
>>> We have a cluster of servers. At the moment there are three servers, each having
two separate instances of CouchDB, like this:
>>>
>>>        node0-couch1
>>>        node0-couch2
>>>
>>>        node1-couch1
>>>        node1-couch2
>>>
>>>        node2-couch1
>>>        node2-couch2
>>>
>>> All couch1 instances are set up to replicate continuously using bidirectional
pull replication. That is:
>>>
>>>        node0-couch1    pulls from node1-couch1 and node2-couch1
>>>        node1-couch1    pulls from node0-couch1 and node2-couch1
>>>        node2-couch1    pulls from node0-couch1 and node1-couch1
>>>
>>> On each node, couch1 and couch2 are set up to replicate each other continuously,
again using pull replication. Thus, the full replication topology is:
>>>
>>>        node0-couch1    pulls from node1-couch, node2-couch1, and node0-couch2
>>>        node0-couch2    pulls from node0-couch1
>>>
>>>        node1-couch1    pulls from node0-couch1, node2-couch1, and node1-couch2
>>>        node1-couch2    pulls from node1-couch1
>>>
>>>        node2-couch1    pulls from node0-couch1, node1-couch1, and node2-couch2
>>>        node2-couch2    pulls from node2-couch1
>>>
>>> No proxies are involved. In our staging system, all servers are on the same subnet.
>>>
>>> The problem is that every night, the entire cluster dies. All instances of CouchDB
crash, and moreover they crash exactly simultaneously.
>>>
>>> The data being replicated is very minimal at the moment - simple log text lines,
no attachments. The entire database being replicated is no more than a few megabytes in size.
>>>
>>> The syslogs give no clue. The CouchDB logs are difficult to interpret unless
you are an Erlang programmer. If anyone would care to look at them, just let me know.
>>>
>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>
>>> We are planning to build quite sophisticated transcluster job queue functionality
on top of CouchDB, but of course a situation like this suggests that CouchDB replication currently
is too unreliable to be of practical use, unless this is a known bug and/or a fixed one.
>>>
>>> Any pointers or ideas are most welcome.
>>>
>>>        / Peter Bengtson
>>>
>>>
>>>
>
>

Mime
View raw message