couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Darren Gibbard (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2070) [1.4.0] CouchDB Replication Crashes
Date Wed, 19 Feb 2014 13:56:21 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13905458#comment-13905458
] 

Darren Gibbard commented on COUCHDB-2070:
-----------------------------------------

Another example of timeout issue, probably related is the following for the compaction dying;
The really concerning thing with this one though is the complaints that the replicator/replication
databases disappeared...! They did return at least.
{noformat}
[Wed, 19 Feb 2014 09:24:16 GMT] [error] [<0.27372.16>] {error_report,<0.30.0>,
                        {<0.27372.16>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,child_terminated},
                          {reason,
                           {compaction_loop_died,
                            {timeout,
                             {gen_server,call,[couch_server,get_server]}}}},
                          {offender,
                           [{pid,<0.7610.20>},
                            {name,compaction_daemon},
                            {mfargs,{couch_compaction_daemon,start_link,[]}},
                            {restart_type,permanent},
                            {shutdown,brutal_kill},
                            {child_type,worker}]}]}}
[Wed, 19 Feb 2014 09:24:17 GMT] [error] [<0.23575.19>] Replicator, request GET to "http://admin:*****@192.168.24.92:5984/pim/_changes?feed=continuous&style=all_docs&since=47931809&heartbeat=10000"
failed due to error {error,req_timedout}
[Wed, 19 Feb 2014 09:24:17 GMT] [error] [<0.5660.20>] Uncaught error in HTTP request:
{exit,
                                                       {timeout,
                                                        {gen_server,call,
                                                         [couch_server,
                                                          get_server]}}}
[Wed, 19 Feb 2014 09:24:18 GMT] [error] [<0.5660.20>] httpd 500 error response:
 {"error":"timeout","reason":"{gen_server,call,[couch_server,get_server]}"}

[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.7793.20>] ** Generic server couch_compaction_daemon
terminating 
** Last message in was {'EXIT',<0.7773.20>,
                           {timeout,
                               {gen_server,call,[couch_server,get_server]}}}
** When Server state == {state,<0.7773.20>}
** Reason for termination == 
** {compaction_loop_died,
       {timeout,{gen_server,call,[couch_server,get_server]}}}

[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.7793.20>] {error_report,<0.30.0>,
                       {<0.7793.20>,crash_report,
                        [[{initial_call,
                           {couch_compaction_daemon,init,['Argument__1']}},
                          {pid,<0.7793.20>},
                          {registered_name,couch_compaction_daemon},
                          {error_info,
                           {exit,
                            {compaction_loop_died,
                             {timeout,
                              {gen_server,call,[couch_server,get_server]}}},
                            [{gen_server,terminate,6,
                              [{file,"gen_server.erl"},{line,744}]},
                             {proc_lib,init_p_do_apply,3,
                              [{file,"proc_lib.erl"},{line,239}]}]}},
                          {ancestors,
                           [couch_secondary_services,couch_server_sup,
                            <0.31.0>]},
                          {messages,[]},
                          {links,[<0.27372.16>]},
                          {dictionary,[]},
                          {trap_exit,true},
                          {status,running},
                          {heap_size,610},
                          {stack_size,27},
                          {reductions,3109}],
                         []]}}
[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.27372.16>] {error_report,<0.30.0>,
                        {<0.27372.16>,supervisor_report,
                         [{supervisor,{local,couch_secondary_services}},
                          {errorContext,child_terminated},
                          {reason,
                           {compaction_loop_died,
                            {timeout,
                             {gen_server,call,[couch_server,get_server]}}}},
                          {offender,
                           [{pid,<0.7793.20>},
                            {name,compaction_daemon},
                            {mfargs,{couch_compaction_daemon,start_link,[]}},
                            {restart_type,permanent},
                            {shutdown,brutal_kill},
                            {child_type,worker}]}]}}
[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.27372.16>] {error_report,<0.30.0>,
                           {<0.27372.16>,supervisor_report,
                            [{supervisor,{local,couch_secondary_services}},
                             {errorContext,shutdown},
                             {reason,reached_max_restart_intensity},
                             {offender,
                                 [{pid,<0.7793.20>},
                                  {name,compaction_daemon},
                                  {mfargs,
                                      {couch_compaction_daemon,start_link,[]}},
                                  {restart_type,permanent},
                                  {shutdown,brutal_kill},
                                  {child_type,worker}]}]}}
[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.23575.19>] Replicator, request GET to "http://admin:*****@192.168.24.92:5984/pim/_changes?feed=continuous&style=all_docs&since=47931809&heartbeat=10000"
failed due to error {error,connection_closed}
[Wed, 19 Feb 2014 09:24:21 GMT] [error] [<0.83.0>] {error_report,<0.30.0>,
                       {<0.83.0>,supervisor_report,
                        [{supervisor,{local,couch_server_sup}},
                         {errorContext,child_terminated},
                         {reason,shutdown},
                         {offender,
                             [{pid,<0.27372.16>},
                              {name,couch_secondary_services},
                              {mfargs,{couch_secondary_sup,start_link,[]}},
                              {restart_type,permanent},
                              {shutdown,infinity},
                              {child_type,supervisor}]}]}}
[Wed, 19 Feb 2014 09:25:13 GMT] [error] [<0.11534.20>] Could not open file /opt/couchdb/dbs/replicator.couch:
no such file or directory
[Wed, 19 Feb 2014 09:25:19 GMT] [error] [<0.11791.20>] Could not open file /opt/couchdb/dbs/replication.couch:
no such file or directory
{noformat}

> [1.4.0] CouchDB Replication Crashes
> -----------------------------------
>
>                 Key: COUCHDB-2070
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2070
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Replication
>            Reporter: Darren Gibbard
>
> Hi all,
> I have an issue at the moment that appears to have followed me from v1.2.1 with erlang
R14, through to an upgrade to v1.4.0 with R16B01.
> I have 20 "remote" nodes, and one "central" node; and each of the remote instances are
configured with Bi-Direction replication (ie. no replication defined on the Central node directly).
Single main database of ~600,000 documents at ~11GB in size.
> On the remote nodes, and more frequently the Central node, I get *huge* (3000+ lines)
errors in the logs- seemingly intermittently; I'm yet to track down the root cause here. Open
file handles and ERL_MAX_PORTS are set to values upwards of 16k.
> Other stats:
> {noformat}
> $ sudo su - couchdb -c "lsof | grep -c ."
> 1511
> $ sudo netstat -npla | grep "ESTAB" | grep -c .
> 310
> $ ps -ef | grep -c "^couchdb" 
> 19
> {noformat}
> An example log from a Remote node is: http://dgunix.com/cdblog/couchdb_v1.4.0_erl16B01.20140218.log
> An example log from the Central node is: http://dgunix.com/cdblog/couchdb_v1.4.0_erl16B01_central.20140218.log
> The main error line is "{error,{error,req_timedout}}}}" for either "_bulk_docs" on remote
nodes, or "_revs_diff" on the central node it would seem.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message