couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gunther Gruber (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2484) replication crashes
Date Thu, 12 Mar 2015 10:39:38 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358470#comment-14358470
] 

Gunther Gruber commented on COUCHDB-2484:
-----------------------------------------

I got some extra Information regarding this problem. i think it has to do with a lot of reads
from the 3TB Database on small files. The system has 4 cores and we set ERL_FLAGS="+A 16",
which is probably much to high. When looking at the processes with strace it looks like that
the couchjs processes wait for a release of a lock much of the time. But this is only a assumtion,
not 100% sure.

> replication crashes
> -------------------
>
>                 Key: COUCHDB-2484
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2484
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: Database Core
>    Affects Versions: 1.x.x
>            Reporter: Gunther Gruber
>
> We are Using Couchdb Version 1.2.0 with 8.3T of data, biggest Database ist 2.1T.  At
this moment we switch to  new hardware with more storage space. We copied the files with rsync
and started the replication. 
> One system is already in sync, the other is doing the replication.
> I appreciate that besides the errors in the log, the first system is now in sync.
> The log looks like the following
> Retrying POST request to http://replication:XXXX/database/_revs_diff in 0.5 seconds due
to error req_timedout
> and then
>  Mon, 01 Dec 2014 13:00:28 GMT] [error] [<0.27044.1>] ** Generic server <0.27044.1>
terminating 
> ** Last message in was {'EXIT',<0.26965.1>,killed}
> ** When Server state == {state,<0.26965.1>,<0.27045.1>,40,
>                             {httpdb,
>                                 "http://replication:XXX@XXX.5984/sm_chemie/",
>                                 nil,
>                                 [{"Accept","application/json"},
>                                  {"User-Agent","CouchDB/1.2.0"}],
>                                 30000,
>                                 [{socket_options,
>                                      [{recbuf,262144},
>                                       {sndbuf,262144},
>                                       {nodelay,true},
>                                       {keepalive,true}]}],
>                                 10,250,<0.26966.1>,40},
>                             {httpdb,
>                                 "http://replication:XXX@XXX:5984/sm_chemie/",
>                                 nil,
>                                 [{"Accept","application/json"},
>                                  {"User-Agent","CouchDB/1.2.0"}],
>                                 30000,
>                                 [{socket_options,
>                                      [{recbuf,262144},
>                                       {sndbuf,262144},
>                                       {nodelay,true},
>                                       {keepalive,true}]}],
>                                 10,250,<0.26968.1>,40},
>                             [],nil,nil,nil,
>                             {rep_stats,0,0,0,0,0},
>                             nil,nil,
>                             {batch,[],0}}
> ** Reason for termination == 
> ** killed
> [Mon, 01 Dec 2014 13:00:28 GMT] [error] [<0.27042.1>] {error_report,<0.31.0>,
>                        {<0.27042.1>,crash_report,
>                         [[{initial_call,
>                            {couch_replicator_worker,init,['Argument__1']}},
>                           {pid,<0.27042.1>},
>                           {registered_name,[]},
>                           {error_info,
>                            {exit,killed,
>                             [{gen_server,terminate,6,
>                               [{file,"gen_server.erl"},{line,747}]},
>                              {proc_lib,init_p_do_apply,3,
>                               [{file,"proc_lib.erl"},{line,227}]}]}},
>                           {ancestors,
>                            [<0.26965.1>,couch_rep_sup,couch_primary_services,
>                             couch_server_sup,<0.32.0>]},
>                           {messages,[]},
>                           {links,[<0.27043.1>]},
>                           {dictionary,
>                            [{last_stats_report,{1417,438797,704976}}]},
>                           {trap_exit,true},
>                           {status,running},
>                           {heap_size,377},
>                           {stack_size,24},
>                           {reductions,372}],
>                          []]}}
> It seems to me like a timeout and the replication task then exits. I allready played
arround with the configuration setting with no succes. I can provide more information if needed.
> /etc/couchdb/local.d/001-user_config.ini
> [couchdb]
> file_compression = snappy
> max_dbs_open = 400
> [httpd]
> bind_address = ::
> server_options = [{backlog, 128}, {acceptor_pool_size, 16}]
> socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}, {keepalive, true}]
> [couch_httpd_auth]
> secret = 
> [log_level_by_module]
> couch_httpd = warning
> couch_replicator = debug
> couch_query_servers = warning 
> [daemons]
> httpsd = {couch_httpd, start_link, [https]}
> [ssl]
> cert_file = /etc/couchdb/ssl/certs/couchdb-couch1.prime.adns.de.pem
> key_file =  /etc/couchdb/ssl/private/couchdb-couch1.prime.adns.de.pem
> verify_ssl_certificates = false
> [replicator]
> worker_batch_size = 2000
> worker_processes = 40
> http_connections = 40
> socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}, {keepalive, true}]
> /etc/default/couchdb
> # Sourced by init script for configuration.
> COUCHDB_USER=couchdb
> COUCHDB_STDOUT_FILE=/dev/null
> COUCHDB_STDERR_FILE=/dev/null
> COUCHDB_RESPAWN_TIMEOUT=5
> COUCHDB_OPTIONS=
> # 32 Threads to handle I/O
> export ERL_FLAGS="+A 32"
> # 8192 open files
> export ERL_MAX_PORTS=8192
> ulimit -n 8192
> Current solution is to restart couchdb every other hour



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message