incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Bengtson <pe...@peterbengtson.com>
Subject Re: Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 15:24:14 GMT
Adam, that's interesting. These crashes occur every night with alarming regularity, but the
staging system on which this runs is under no load to speak about. And there are only two
DBs in the system at this point, both of which were opened at least 12 hours earlier. I'll
ask our sysadmins to double-check the load, but I'd like to know one thing:

Why do these crashes occur system-wide? On three nodes and six servers? And at the same time?
Somehow, we didn't quite expect that CouchDB should go quite so far as to replicate the crashes...
;-)

	/ Peter


5 mar 2010 kl. 15.57 skrev Adam Kocoloski:

> From that log we can tell that CouchDB crashed completely on node0-couch2 (because of
the "Apache CouchDB has started .." message).  The crashes indicating a timeout on couch_server:open
are troubling.  I've usually only seen that when a system is way overloaded, although it could
also happen if you try to open a large number of previously-unopened DBs simultaneously.
> 
> Adam
> 
> On Mar 5, 2010, at 8:29 AM, Peter Bengtson wrote:
> 
>> It seems as if only the replication tasks crash, as the rest of the CouchDB functionality
still seems to be online, or, alternatively, is restarted so that it appears that way.
>> 
>> This is what happens on the node0-couch2 at the time of the error. There seems to
be a lot of disconnected sockets:
>> 
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>>   {<0.63.0>,std_error,
>>    {mochiweb_socket_server,235,
>>        {child_error,{case_clause,{error,enotconn}}}}}}
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
>>   {<0.22982.2>,crash_report,
>>    [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
>>      {pid,<0.22982.2>},
>>      {registered_name,[]},
>>      {error_info,
>>          {error,
>>              {case_clause,{error,enotconn}},
>>              [{mochiweb_request,get,2},
>>               {couch_httpd,handle_request,5},
>>               {mochiweb_http,headers,5},
>>               {proc_lib,init_p_do_apply,3}]}},
>>      {ancestors,
>>          [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
>>      {messages,[]},
>>      {links,[<0.63.0>,#Port<0.34758>]},
>>      {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
>>      {trap_exit,false},
>>      {status,running},
>>      {heap_size,2584},
>>      {stack_size,24},
>>      {reductions,2164}],
>> [Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
>>   {<0.63.0>,std_error,
>>    {mochiweb_socket_server,235,
>>        {child_error,{case_clause,{error,enotconn}}}}}}
>> [Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started
on http://0.0.0.0:5984/
>> [Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP request:
{exit,
>>                                {timeout,
>>                                 {gen_server,call,
>>                                  [couch_server,
>>                                   {open,<<"laplace_log_staging">>,
>>                                    [{user_ctx,
>>                                      {user_ctx,null,[<<"_admin">>]}}]}]}}}
>> [Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace: [{gen_server,call,2},
>>            {couch_server,open,2},
>>            {couch_httpd_db,do_db_req,2},
>>            {couch_httpd,handle_request,5},
>>            {mochiweb_http,headers,5},
>>            {proc_lib,init_p_do_apply,3}]
>> [Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started
on http://0.0.0.0:5984/
>> [Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP request:
{exit,normal}
>> [Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace: [{mochiweb_request,send,2},
>>            {mochiweb_request,respond,2},
>>            {couch_httpd,send_response,4},
>>            {couch_httpd,handle_request,5},
>>            {mochiweb_http,headers,5},
>>            {proc_lib,init_p_do_apply,3}]
>> [Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP
request: {exit,
>>                                {timeout,
>>                                 {gen_server,call,
>>                                  [couch_server,
>>                                   {open,<<"laplace_log_staging">>,
>>                                    [{user_ctx,
>>                                      {user_ctx,null,[<<"_admin">>]}}]}]}}}
>> [Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started
on http://0.0.0.0:5984/
>> 
>> 
>> 
>> On 5 mar 2010, at 14.22, Robert Newson wrote:
>> 
>>> Is couchdb crashing or just the replication tasks?
>>> 
>>> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <peter@peterbengtson.com>
wrote:
>>>> The amount of logged data on the six servers is vast, but this is the crash
message on node0-couch1. It's perhaps easier if I make the full log files available (give
me a shout). Here's the snippet:
>>>> 
>>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server
<0.2092.0> terminating
>>>> ** Last message in was {ibrowse_async_response,
>>>>                         {1267,713465,777255},
>>>>                         {error,connection_closed}}
>>>> ** When Server state == {state,nil,nil,
>>>>                          [<0.2077.0>,
>>>>                           {http_db,
>>>>                               "http://couch2.staging.diino.com:5984/laplace_conf_staging/",
>>>>                               [{"User-Agent","CouchDB/0.10.1"},
>>>>                                {"Accept","application/json"},
>>>>                                {"Accept-Encoding","gzip"}],
>>>>                               [],get,nil,
>>>>                               [{response_format,binary},
>>>>                                {inactivity_timeout,30000}],
>>>>                               10,500,nil},
>>>>                           251,
>>>>                           [{<<"continuous">>,true},
>>>>                            {<<"source">>,
>>>>                             <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
>>>>                            {<<"target">>,
>>>>                             <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
>>>>                          251,<0.2093.0>,
>>>>                          {1267,713465,777255},
>>>>                          false,0,<<>>,
>>>>                          {<0.2095.0>,#Ref<0.0.0.131534>},
>>>> ** Reason for termination ==
>>>> ** {error,connection_closed}
>>>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server
<0.2130.0> terminating
>>>> ** Last message in was {ibrowse_async_response,
>>>>                         {1267,713465,843079},
>>>>                         {error,connection_closed}}
>>>> ** When Server state == {state,nil,nil,
>>>>                          [<0.2106.0>,
>>>>                           {http_db,
>>>>                               "http://couch2.staging.diino.com:5984/laplace_log_staging/",
>>>>                               [{"User-Agent","CouchDB/0.10.1"},
>>>>                                {"Accept","application/json"},
>>>>                                {"Accept-Encoding","gzip"}],
>>>>                               [],get,nil,
>>>>                               [{response_format,binary},
>>>>                                {inactivity_timeout,30000}],
>>>>                               10,500,nil},
>>>>                           28136,
>>>>                           [{<<"continuous">>,true},
>>>>                            {<<"source">>,
>>>>                             <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
>>>>                            {<<"target">>,
>>>>                             <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
>>>>                          29086,<0.2131.0>,
>>>>                          {1267,713465,843079},
>>>>                          false,0,<<>>,
>>>>                          {<0.2133.0>,#Ref<0.0.5.183681>},
>>>> ** Reason for termination ==
>>>> ** {error,connection_closed}
>>>> 
>>>> 
>>>> 
>>>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>>>> 
>>>>> Can you include some of the log output?
>>>>> 
>>>>> A coordinated failure like this points to external factors but log
>>>>> output will help in any case.
>>>>> 
>>>>> B.
>>>>> 
>>>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <peter@peterbengtson.com>
wrote:
>>>>>> We have a cluster of servers. At the moment there are three servers,
each having two separate instances of CouchDB, like this:
>>>>>> 
>>>>>>      node0-couch1
>>>>>>      node0-couch2
>>>>>> 
>>>>>>      node1-couch1
>>>>>>      node1-couch2
>>>>>> 
>>>>>>      node2-couch1
>>>>>>      node2-couch2
>>>>>> 
>>>>>> All couch1 instances are set up to replicate continuously using bidirectional
pull replication. That is:
>>>>>> 
>>>>>>      node0-couch1    pulls from node1-couch1 and node2-couch1
>>>>>>      node1-couch1    pulls from node0-couch1 and node2-couch1
>>>>>>      node2-couch1    pulls from node0-couch1 and node1-couch1
>>>>>> 
>>>>>> On each node, couch1 and couch2 are set up to replicate each other
continuously, again using pull replication. Thus, the full replication topology is:
>>>>>> 
>>>>>>      node0-couch1    pulls from node1-couch, node2-couch1, and node0-couch2
>>>>>>      node0-couch2    pulls from node0-couch1
>>>>>> 
>>>>>>      node1-couch1    pulls from node0-couch1, node2-couch1, and node1-couch2
>>>>>>      node1-couch2    pulls from node1-couch1
>>>>>> 
>>>>>>      node2-couch1    pulls from node0-couch1, node1-couch1, and node2-couch2
>>>>>>      node2-couch2    pulls from node2-couch1
>>>>>> 
>>>>>> No proxies are involved. In our staging system, all servers are on
the same subnet.
>>>>>> 
>>>>>> The problem is that every night, the entire cluster dies. All instances
of CouchDB crash, and moreover they crash exactly simultaneously.
>>>>>> 
>>>>>> The data being replicated is very minimal at the moment - simple
log text lines, no attachments. The entire database being replicated is no more than a few
megabytes in size.
>>>>>> 
>>>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret
unless you are an Erlang programmer. If anyone would care to look at them, just let me know.
>>>>>> 
>>>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>>>> 
>>>>>> We are planning to build quite sophisticated transcluster job queue
functionality on top of CouchDB, but of course a situation like this suggests that CouchDB
replication currently is too unreliable to be of practical use, unless this is a known bug
and/or a fixed one.
>>>>>> 
>>>>>> Any pointers or ideas are most welcome.
>>>>>> 
>>>>>>      / Peter Bengtson
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
> 


Mime
View raw message