couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: [2.0] Replication Issues
Date Sun, 26 Jul 2015 17:03:52 GMT

> On 26 Jul 2015, at 14:47, Jan Lehnardt <jan@apache.org> wrote:
> 
> Hey all,
> 
> I’m trying to upgrade a database from 1.6.1 to 2.0.0/master/0c579b98 and I’m seeing
a number of issues.
> 
> Any help is greatly appreciated. Since this is our official upgrade path for 2.0, this
has to be rock-solid.
> 
> Feel free to break out individual issue into new threads, if it helps keeping things
organised.
> 
> Scroll down for detailed information about the database, and machine configurations.
> 
> 
> ## The Scenario
> 
> Replication is running on 2.0, pulling from 1.6.1 over the EC2 internal ip address.
> 
> ## The Issues
> 
> 1. repeated log entries for “write quorum for <targetdb> failed”. I’ve seen
this in other contexts as well, why is this happening and should it?
> 
> 
> 2. getting a lot of “cassim_metadata_cache changes listener died” from all nodes
about every 5 seconds. What’s up with these?
> 
> - 2015-07-26 08:30:34.400 [error] Undefined emulator Error in process <0.14633.26>
on node 'node3@127.0.0.1' with exit value: {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}
> 
> - 2015-07-26 08:30:39.401 [notice] node3@127.0.0.1 <0.314.0> cassim_metadata_cache
changes listener died {function_clause,[{cassim_metadata_cache,changes_callback,[waiting_for_updates,"0"]},{fabric_view_changes,keep_sending_changes,8},{fabric_view_changes,go,5}]}

Alexander pointed to https://github.com/apache/couchdb-fabric/commit/b6659c8344c9a028b5ab451be41a991801c2ab3d#diff-2af86e058b4e7a4a99a7c5a12da6debdR96
which is part of Adam’s recent work on COUCHDB-2724.

Adam, any insights? :)

Best
Jan
--



> 
> 
> 3. A number of  Replicator, request PUT to "http://0.0.0.0:15984/<target>/edbef049aae9c8828f336534984e5e4f"
failed due to error {error,req_timedout} this happens for regular docs, local docs, and _bulk_docs.
The machine is basically idle (see below for details), the three beam.smp processes over at
200-250% CPU each, io is 98% idle (it’s mostly logs being written), the machine is basically
idle.
> 
> 
> 4, two issues from couch_replicator_api_wrap.erl:
> 
> - 2015-07-26 08:22:49.849 [error] Undefined <0.3546.0> gen_server <0.3546.0>
terminated with reason: no function clause matching couch_replicator_api_wrap:'-update_docs/4-fun-2-'(400,
[{"Server","MochiWeb/1.0 (Any of you quaids got a smint?)"},{"Date","Sun, 26 Jul 2015 08:22:49
G..."},...], null, [<<"{\"_id\":\”12345678\",\"_rev\":\"1050-ee6c7d54276b43bc937470e44e0283f2\”,...
> 
> - 2015-07-26 08:30:08.514 [notice] node3@127.0.0.1 <0.6360.26> Retrying GET to
http://172.31.10.115:5984/generic_db_name/12348765?revs=true&open_revs=%5B%228-b2826209867a286c76e6a2762f10b1e0%22%5D&latest=true
in 1.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,run_user_fun,4},{couch_replicator_api_wrap,receive_docs,4},{couch_replicator_api_wrap,receive_docs_loop,6},{couch_replicator_api_wrap,'-open_doc_revs/6-fun-4-',7}]}
> 
> 
> 
> 5. Eventually, replication reliably stops with an “invalid_ejson” error, but I don’t
yet know if that’s because of the api_wrap issue or something else.
> 
> 
> 
> 6. Replication has stopped numerous times until I got here, I didn’t have time to look
into why that happened, but I have all the logs, but they are 130MB total, so it’ll be a
while.
> 
> 
> 7. When replication ran, it replicated at a rate of about 1000 docs/s, which felt a little
slow, but I have no experience there, yet.
> 
> 
> ## Source Database Info
> 
> {
>  "db_name": "generic_db_name",
>  "doc_count": 6808004,
>  "doc_del_count": 18856,
>  "update_seq": 8044450,
>  "purge_seq": 0,
>  "compact_running": false,
>  "disk_size": 16293904519,
>  "data_size": 11711402577,
>  "instance_start_time": "1437834202967309",
>  "disk_format_version": 6,
>  "committed_update_seq": 8044450
> }
> 
> Mostly small-ish docs, no big outliers, no attachments.
> 
> Source machine info:
> 
> Amazon EC2 m3.xlarge 4 cores, 64bit, 16GB RAM, 100GB SSD, 3000 provisioned iops. FFM
Availability Zone.
> 
> Standard EC2 Ubuntu, Erlang R16B03 (I know, but that’s not the problem here, this couch
behaves fine).
> 
> Target machine info:
> 
> Amazon EC2 m4.10xlarge, 40 cores, 64bit, 160GB RAM, 100GB SSD, 3000 iops (not provisioned),
10GigE networking, FFM AZ.
> 
> The latency between both instances is very small and the network throughput is (copying
a file is between 100 and 200MB/s).
> 
> Standard EC2 Amazon Linux (Redhat/Fedora derivative), Erlang R14B04. CouchDB 2.0 running
as dev/run
> 
> 
> Thanks!
> Jan
> -- 
> 

-- 
Professional Support for Apache CouchDB:
http://www.neighbourhood.ie/couchdb-support/


Mime
View raw message