Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B3F09174 for ; Thu, 13 Oct 2011 19:05:27 +0000 (UTC) Received: (qmail 36359 invoked by uid 500); 13 Oct 2011 19:05:26 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 36301 invoked by uid 500); 13 Oct 2011 19:05:26 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Delivered-To: moderator for dev@couchdb.apache.org Received: (qmail 60979 invoked by uid 99); 11 Oct 2011 12:21:53 -0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) MIME-Version: 1.0 In-Reply-To: References: <2903A6F8-BF28-4DA8-80CC-B7B45ADD4057@apache.org> From: Mark Hahn Date: Tue, 11 Oct 2011 05:20:55 -0700 Message-ID: Subject: Re: CouchDB Replication lacking resilience for many database To: user@couchdb.apache.org Cc: dev@couchdb.apache.org Content-Type: multipart/alternative; boundary=001636c5bed6348dc204af04f27e --001636c5bed6348dc204af04f27e Content-Type: text/plain; charset=ISO-8859-1 It would be nice to have a control panel that displays things like this message queue depth, connection counts, memory consumed, cpu consumed, reads/writes per second, view rebuilds/sec, avg response times, etc. I'm sure someone could come up with many more pertinent vars. For extra credit the values could be plotted against time. When someone has a problem they could post the log here. On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis wrote: > On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton > wrote: > > Hello, > > > > On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski > wrote: > >> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote: > >> > >>> Hello, > >>> > >>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana > >>> wrote: > >>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton > >>>> wrote: > >>>> Chris, > >>>> > >>>> That said work is in the'1.2.x' branch (and master). > >>>> CouchDB recently migrated from SVN to GIT, see: > >>>> http://couchdb.apache.org/community/code.html > >>>> > >>> > >>> Thank you very much for the response Filipe, do you possibly have any > >>> documentation or more detailed summary on what these changes include > >>> and possible benefits of them? I would love to hear about any tweaking > >>> or replication tips you may have for our growth issues, perhaps you > >>> could answer a basic question if nothing else: Do the changes in this > >>> branch minimize the performance impact of continuous replication on > >>> many databases? > >>> > >>> Regardless I plan on getting a build of that branch and doing some > >>> testing of my own very soon. > >>> > >>> Thank you! > >>> > >>> -Chris > >> > >> I'm pretty sure that even in 1.2.x and master each replication with a > remote source still requires one dedicated TCP connection to consume the > _changes feed. Replications with a local source have always been able to > use a connection pool per host:port combination. That's not to downplay the > significance of the rewrite of the replicator in 1.2.x; Filipe put quite a > lot of time into it. > >> > >> The link to "those darn errors" just pointed to the mbox browser for > September 2011. Do you have a more specific link? Regards, > >> > >> Adam > > > > Well I will remain optimistic that the rewrite could hopefully have > > solved several of my issues regardless I hope. I guess the idle TCP > > connections by themselves are not too bad, when they all start to work > > simultaneously I think is what becomes the issue =) > > > > Sorry Adam, here is a better link > > > http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKFbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E > , > > the actual text was: > > > > --------------- > > > > It seems that randomly I am getting errors about crashes as our > > replicator runs, all this replicator does is make sure that all > > databases on the master server replicate to our failover by checking > > status. > > > > Details: > > - I notice the below error in the logs, anywhere from 0 to 30 at a time. > > - It seems that a database might start replicating okay then stop. > > - These errors [1] are on the failover pulling from master > > - No errors are displayed on the master server > > - The databases inside the URL in the db_not_found portion of the > > error, are always available from curl from the failover machine, which > > makes the error strange, somehow it thinks it can't find the database > > - Master seems healthy at all times, all database are available, no > > errors in log > > > > [1] -- > > [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>] > > {error_report,<0.30.0>, > > {<0.22466.5305>,crash_report, > > > [[{initial_call,{couch_rep,init,['Argument__1']}}, > > {pid,<0.22466.5305>}, > > {registered_name,[]}, > > {error_info, > > {exit, > > {db_not_found, > > <<"http://user:pass@server > :5984/db_10944/">>}, > > [{gen_server,init_it,6}, > > {proc_lib,init_p_do_apply,3}]}}, > > {ancestors, > > [couch_rep_sup,couch_primary_services, > > couch_server_sup,<0.31.0>]}, > > {messages,[]}, > > {links,[<0.81.0>]}, > > {dictionary,[]}, > > {trap_exit,true}, > > {status,running}, > > {heap_size,2584}, > > {stack_size,24}, > > {reductions,794}], > > []]}} > > > > One place I've seen this error pop up when it looks like it shouldn't > is if couch_server gets backed up. If you remsh into one of those db's > you could try the following: > > > process_info(whereis(couch_server), message_queue_len). > > And if that number keeps growing, that could be the issue. > --001636c5bed6348dc204af04f27e--