Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 7213 invoked from network); 26 Aug 2009 20:34:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Aug 2009 20:34:26 -0000 Received: (qmail 62528 invoked by uid 500); 26 Aug 2009 20:34:25 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 62455 invoked by uid 500); 26 Aug 2009 20:34:25 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 62437 invoked by uid 99); 26 Aug 2009 20:34:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Aug 2009 20:34:25 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=FS_REPLICA,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of blair@orcaware.com designates 12.11.234.124 as permitted sender) Received: from [12.11.234.124] (HELO orca3.orcaware.com) (12.11.234.124) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Aug 2009 20:34:15 +0000 Received: from orca3.orcaware.com (localhost [127.0.0.1]) by orca3.orcaware.com (8.14.2/8.14.2/Debian-2build1) with ESMTP id n7QKXnv7012202; Wed, 26 Aug 2009 13:33:52 -0700 Message-ID: <4A959C2D.7060804@orcaware.com> Date: Wed, 26 Aug 2009 13:33:49 -0700 From: Blair Zajac User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Re: Replication and new user questions References: <4A94534F.9020800@orcaware.com> <270E7686-01D6-41A7-882D-6E786BF423B3@apache.org> In-Reply-To: <270E7686-01D6-41A7-882D-6E786BF423B3@apache.org> Content-Type: multipart/mixed; boundary="------------050800040804090009020101" X-Virus-Checked: Checked by ClamAV on apache.org --------------050800040804090009020101 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi Adam, Thanks for the quick reply, I appreciate it. I neglected to mention that we're running trunk r807360. Replies inline below. Adam Kocoloski wrote: > Hi Blair, all good questions, I'll try to answer inline: > > On Aug 25, 2009, at 5:10 PM, Blair Zajac wrote: > >> 1) What's the most robust automatic replication mechanism? While >> continuous replication looks nice, I see there's some tickets open >> with it and that it has issues with four nodes. Is a more robust >> solution, but a little slower and heavier, it to have an >> update_notification that manually POSTs to _replicate? > > We're committed to making continuous replication as robust and > performant as possible. The entire replication codebase went through a > significant refactoring after 0.9, and what you're seeing is us ironing > out a few of the kinks before 0.10 gets out the door. I'd encourage you > to give "continuous":true a shot, provided my answer to 2) isn't a > deal-breaker. No, 2) isn't a deal breaker. >> Will there be a way to manage the list of replicant databases when the >> persistent continuous replication feature is complete? > > Absolutely yes. It will probably be a special DB called _replication > where you can PUT and DELETE documents that configure continuous > replications. That's great. Is there a wiki or place that CouchDB keeps its design documents for new features for people to learn about, e.g. a ticket or checked into svn as a text file? >> 3) How does continuous replication deal with network outages, say if a >> link goes down between the Los Angeles and Bristol data centers? Does >> CouchDB deal with a hanging TCP connection ok? > > CouchDB retries requests using a timeout that doubles with every > failure. It does this for about 5 minutes, then gives up. That sounds like it would still then require an external script to start the replication again. In fact, our Bristol office had a power outage earlier today that lasted over an hour, so to write a script to kick start replication again would be inconvenient. >> 5) I wrote the following Bourne shell script and after running it for >> an hour, it consumes 100% of a CPU. This is even after stopping the >> shell script and compacting both databases. What would explain this >> behavior? > > I couldn't quite get that script to work ($HOST2 was undefined, and then > something else failed), but can you try it again with a fresh checkout? > I fixed a bug last night that could very well have caused this. Best, I've attached the latest version of the script which I just ran. After multiple runs of the script and letting it run indefinitely, I've noticed that something will fail in CouchDB and the script will either wait forever for the key to appear in the other database or a PUT will fail. The last error I got was that using curl and PUT returned nothing and this error in my shell. It scrolled past the top, so I don't have the top of the stack: ** Reason for termination == ** changes_loop_died [error] [<0.144.2060>] {error_report,<0.23.0>, {<0.144.2060>,crash_report, [[{initial_call,{couch_rep,init,['Argument__1']}}, {pid,<0.144.2060>}, {registered_name,[]}, {error_info,{exit,changes_loop_died, [{gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {ancestors,[couch_rep_sup,couch_primary_services, couch_server_sup,<0.1.0>]}, {messages,[]}, {links,[<0.162.2060>,<0.164.2060>,<0.103.2060>,<0.160.2060>]}, {dictionary,[{task_status_update,{{1251,317928,943764},0}}]}, {trap_exit,true}, {status,running}, {heap_size,2584}, {stack_size,24}, {reductions,2630900}], [{neighbour,[{pid,<0.164.2060>}, {registered_name,[]}, {initial_call,{erlang,apply,2}}, {current_function,{gen,wait_resp_mon,3}}, {ancestors,[]}, {messages,[]}, {links,[<0.144.2060>]}, {dictionary,[]}, {trap_exit,false}, {status,waiting}, {heap_size,987}, {stack_size,17}, {reductions,844815}]}]]}} [error] [<0.160.2060>] ** Generic server <0.160.2060> terminating ** Last message in was {'EXIT',<0.144.2060>,changes_loop_died} ** When Server state == {state,<0.161.2060>,nil, {db,<0.153.2060>,<0.154.2060>,nil, <<"1251313203438699">>,<0.151.2060>, <0.155.2060>, {db_header,4,44914,0, {272424021,{6,12656}}, {272425998,12662}, {272398470,[]}, 0,nil,nil,1000}, 44914, {btree,<0.151.2060>, {272424021,{6,12656}}, #Fun, #Fun, #Fun, #Fun}, {btree,<0.151.2060>, {272425998,12662}, #Fun, #Fun, #Fun, #Fun}, {btree,<0.151.2060>, {272398470,[]}, #Fun, #Fun, #Fun,nil}, 44914,<<"db2">>, "/tmp/blair/couchdb.git-3/etc/couchdb/../../tmp/lib/db2.couch", [],[],nil, {user_ctx,null,[<<"_admin">>]}, nil,1000, [before_header,after_header,on_file_open]}, <0.144.2060>,false,0, {<0.163.2060>,#Ref<0.0.54.110898>}, {[],[]}, 57374,57374,57374} ** Reason for termination == ** changes_loop_died Regards, Blair --------------050800040804090009020101 Content-Type: application/x-sh; name="replicate.sh" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="replicate.sh" #!/bin/sh unset PROXY unset http_proxy unset HTTP_PROXY unset https_proxy unset HTTPS_PROXY HOST1=http://localhost:5984 HOST2=http://localhost:5984 DB1=$HOST1/db1 DB2=$HOST2/db2 #curl -X DELETE $DB1 #curl -X DELETE $DB2 curl -X PUT $DB1 curl -X PUT $DB2 curl -X POST $HOST1/_replicate -d '{"source": "db1", "target": "db2", "continuous": true}' curl -X POST $HOST2/_replicate -d '{"source": "db2", "target": "db1", "continuous": true}' while true; do micros=`python -c 'import time; print int(1000000*time.time())'` echo Working on $DB1/$micros json=`curl -X PUT $DB1/$micros -d "{\"name\": $micros}" 2>/dev/null` if test "x$json" = "x"; then echo "!!! I just did a PUT but I got nothing back !!!" exit 1 fi rev=`js -e "var a = $json; print(a.rev);"` while curl $DB2/$micros 2>/dev/null | grep error; do echo " Does not exist yet at $DB2/$micros." sleep 1 done echo " It exists now at $DB2/$micros." curl -X DELETE "$DB2/$micros?rev=$rev" >/dev/null 2>&1 while curl $DB1/$micros 2>/dev/null | grep _rev; do echo " It has not been deleted yet at $DB1/$micros" sleep 1 done echo " It has been deleted at $DB1/$micros." echo done --------------050800040804090009020101--