incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Bengtson <pe...@peterbengtson.com>
Subject Re: Entire CouchDB cluster crashes simultaneously
Date Fri, 05 Mar 2010 13:29:30 GMT
It seems as if only the replication tasks crash, as the rest of the CouchDB functionality still
seems to be online, or, alternatively, is restarted so that it appears that way.

This is what happens on the node0-couch2 at the time of the error. There seems to be a lot
of disconnected sockets:

[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
    {<0.63.0>,std_error,
     {mochiweb_socket_server,235,
         {child_error,{case_clause,{error,enotconn}}}}}}
[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.22982.2>] {error_report,<0.24.0>,
    {<0.22982.2>,crash_report,
     [[{initial_call,{mochiweb_socket_server,acceptor_loop,['Argument__1']}},
       {pid,<0.22982.2>},
       {registered_name,[]},
       {error_info,
           {error,
               {case_clause,{error,enotconn}},
               [{mochiweb_request,get,2},
                {couch_httpd,handle_request,5},
                {mochiweb_http,headers,5},
                {proc_lib,init_p_do_apply,3}]}},
       {ancestors,
           [couch_httpd,couch_secondary_services,couch_server_sup,<0.2.0>]},
       {messages,[]},
       {links,[<0.63.0>,#Port<0.34758>]},
       {dictionary,[{mochiweb_request_qs,[]},{jsonp,undefined}]},
       {trap_exit,false},
       {status,running},
       {heap_size,2584},
       {stack_size,24},
       {reductions,2164}],
[Fri, 05 Mar 2010 04:55:12 GMT] [error] [<0.63.0>] {error_report,<0.24.0>,
    {<0.63.0>,std_error,
     {mochiweb_socket_server,235,
         {child_error,{case_clause,{error,enotconn}}}}}}
[Fri, 05 Mar 2010 04:55:32 GMT] [info] [<0.2.0>] Apache CouchDB has started on http://0.0.0.0:5984/
[Fri, 05 Mar 2010 04:55:50 GMT] [error] [<0.82.0>] Uncaught error in HTTP request: {exit,
                                 {timeout,
                                  {gen_server,call,
                                   [couch_server,
                                    {open,<<"laplace_log_staging">>,
                                     [{user_ctx,
                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
[Fri, 05 Mar 2010 04:55:50 GMT] [info] [<0.82.0>] Stacktrace: [{gen_server,call,2},
             {couch_server,open,2},
             {couch_httpd_db,do_db_req,2},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]
[Fri, 05 Mar 2010 04:56:24 GMT] [info] [<0.2.0>] Apache CouchDB has started on http://0.0.0.0:5984/
[Fri, 05 Mar 2010 04:56:26 GMT] [error] [<0.66.0>] Uncaught error in HTTP request: {exit,normal}
[Fri, 05 Mar 2010 04:56:26 GMT] [info] [<0.66.0>] Stacktrace: [{mochiweb_request,send,2},
             {mochiweb_request,respond,2},
             {couch_httpd,send_response,4},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]
[Fri, 05 Mar 2010 05:25:37 GMT] [error] [<0.2694.0>] Uncaught error in HTTP request:
{exit,
                                 {timeout,
                                  {gen_server,call,
                                   [couch_server,
                                    {open,<<"laplace_log_staging">>,
                                     [{user_ctx,
                                       {user_ctx,null,[<<"_admin">>]}}]}]}}}
[Fri, 05 Mar 2010 05:26:00 GMT] [info] [<0.2.0>] Apache CouchDB has started on http://0.0.0.0:5984/



On 5 mar 2010, at 14.22, Robert Newson wrote:

> Is couchdb crashing or just the replication tasks?
> 
> On Fri, Mar 5, 2010 at 8:15 AM, Peter Bengtson <peter@peterbengtson.com> wrote:
>> The amount of logged data on the six servers is vast, but this is the crash message
on node0-couch1. It's perhaps easier if I make the full log files available (give me a shout).
Here's the snippet:
>> 
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2092.0>] ** Generic server <0.2092.0>
terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,777255},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2077.0>,
>>                             {http_db,
>>                                 "http://couch2.staging.diino.com:5984/laplace_conf_staging/",
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             251,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               <<"http://couch2.staging.diino.com:5984/laplace_conf_staging">>},
>>                              {<<"target">>,
>>                               <<"http://couch1.staging.diino.com:5984/laplace_conf_staging">>}]],
>>                            251,<0.2093.0>,
>>                            {1267,713465,777255},
>>                            false,0,<<>>,
>>                            {<0.2095.0>,#Ref<0.0.0.131534>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> [Fri, 05 Mar 2010 04:55:55 GMT] [error] [<0.2130.0>] ** Generic server <0.2130.0>
terminating
>> ** Last message in was {ibrowse_async_response,
>>                           {1267,713465,843079},
>>                           {error,connection_closed}}
>> ** When Server state == {state,nil,nil,
>>                            [<0.2106.0>,
>>                             {http_db,
>>                                 "http://couch2.staging.diino.com:5984/laplace_log_staging/",
>>                                 [{"User-Agent","CouchDB/0.10.1"},
>>                                  {"Accept","application/json"},
>>                                  {"Accept-Encoding","gzip"}],
>>                                 [],get,nil,
>>                                 [{response_format,binary},
>>                                  {inactivity_timeout,30000}],
>>                                 10,500,nil},
>>                             28136,
>>                             [{<<"continuous">>,true},
>>                              {<<"source">>,
>>                               <<"http://couch2.staging.diino.com:5984/laplace_log_staging">>},
>>                              {<<"target">>,
>>                               <<"http://couch1.staging.diino.com:5984/laplace_log_staging">>}]],
>>                            29086,<0.2131.0>,
>>                            {1267,713465,843079},
>>                            false,0,<<>>,
>>                            {<0.2133.0>,#Ref<0.0.5.183681>},
>> ** Reason for termination ==
>> ** {error,connection_closed}
>> 
>> 
>> 
>> On 5 mar 2010, at 13.44, Robert Newson wrote:
>> 
>>> Can you include some of the log output?
>>> 
>>> A coordinated failure like this points to external factors but log
>>> output will help in any case.
>>> 
>>> B.
>>> 
>>> On Fri, Mar 5, 2010 at 7:18 AM, Peter Bengtson <peter@peterbengtson.com>
wrote:
>>>> We have a cluster of servers. At the moment there are three servers, each
having two separate instances of CouchDB, like this:
>>>> 
>>>>        node0-couch1
>>>>        node0-couch2
>>>> 
>>>>        node1-couch1
>>>>        node1-couch2
>>>> 
>>>>        node2-couch1
>>>>        node2-couch2
>>>> 
>>>> All couch1 instances are set up to replicate continuously using bidirectional
pull replication. That is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch1 and node2-couch1
>>>>        node1-couch1    pulls from node0-couch1 and node2-couch1
>>>>        node2-couch1    pulls from node0-couch1 and node1-couch1
>>>> 
>>>> On each node, couch1 and couch2 are set up to replicate each other continuously,
again using pull replication. Thus, the full replication topology is:
>>>> 
>>>>        node0-couch1    pulls from node1-couch, node2-couch1, and node0-couch2
>>>>        node0-couch2    pulls from node0-couch1
>>>> 
>>>>        node1-couch1    pulls from node0-couch1, node2-couch1, and node1-couch2
>>>>        node1-couch2    pulls from node1-couch1
>>>> 
>>>>        node2-couch1    pulls from node0-couch1, node1-couch1, and node2-couch2
>>>>        node2-couch2    pulls from node2-couch1
>>>> 
>>>> No proxies are involved. In our staging system, all servers are on the same
subnet.
>>>> 
>>>> The problem is that every night, the entire cluster dies. All instances of
CouchDB crash, and moreover they crash exactly simultaneously.
>>>> 
>>>> The data being replicated is very minimal at the moment - simple log text
lines, no attachments. The entire database being replicated is no more than a few megabytes
in size.
>>>> 
>>>> The syslogs give no clue. The CouchDB logs are difficult to interpret unless
you are an Erlang programmer. If anyone would care to look at them, just let me know.
>>>> 
>>>> Any clues as to why this is happening? We're using 0.10.1 on Debian.
>>>> 
>>>> We are planning to build quite sophisticated transcluster job queue functionality
on top of CouchDB, but of course a situation like this suggests that CouchDB replication currently
is too unreliable to be of practical use, unless this is a known bug and/or a fixed one.
>>>> 
>>>> Any pointers or ideas are most welcome.
>>>> 
>>>>        / Peter Bengtson
>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message