From user-return-18263-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Tue Oct 11 05:17:05 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 345DC9473 for ; Tue, 11 Oct 2011 05:17:05 +0000 (UTC) Received: (qmail 12637 invoked by uid 500); 11 Oct 2011 05:17:02 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 12286 invoked by uid 500); 11 Oct 2011 05:17:01 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 12268 invoked by uid 99); 11 Oct 2011 05:16:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 05:16:59 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.212.52 as permitted sender) Received: from [209.85.212.52] (HELO mail-vw0-f52.google.com) (209.85.212.52) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 05:16:52 +0000 Received: by vws10 with SMTP id 10so8179061vws.11 for ; Mon, 10 Oct 2011 22:16:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=joqqod00wCY8eX/d2+i2G/Wg98AASgnkXMDbSlUuHMw=; b=N3JXJgzt7YAE1zUZm+kRfmeBTwdmyKVX8IuGP0vJT+hevtoPi6fWJbGwTyDp47jPsV 2jAObsvIzAiwoxqoTgTxY3UwrL5tBSZJAp0lxNW2jRm7cCzmPE8NWFc9CA4QGCK9dXM7 PxMDS+yhycbXtwufXiH/rPvCLakXxEx63yDWE= Received: by 10.52.88.10 with SMTP id bc10mr16586819vdb.4.1318310191246; Mon, 10 Oct 2011 22:16:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.160.130 with HTTP; Mon, 10 Oct 2011 22:15:51 -0700 (PDT) In-Reply-To: References: <2903A6F8-BF28-4DA8-80CC-B7B45ADD4057@apache.org> From: Paul Davis Date: Tue, 11 Oct 2011 00:15:51 -0500 Message-ID: Subject: Re: CouchDB Replication lacking resilience for many database To: user@couchdb.apache.org Cc: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton wrote: > Hello, > > On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski wro= te: >> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote: >> >>> Hello, >>> >>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana >>> wrote: >>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton >>>> wrote: >>>> Chris, >>>> >>>> That said work is in the'1.2.x' branch (and master). >>>> CouchDB recently migrated from SVN to GIT, see: >>>> http://couchdb.apache.org/community/code.html >>>> >>> >>> Thank you very much for the response Filipe, do you possibly have any >>> documentation or more detailed summary on what these changes include >>> and possible benefits of them? I would love to hear about any tweaking >>> or replication tips you may have for our growth issues, perhaps you >>> could answer a basic question if nothing else: Do the changes in this >>> branch minimize the performance impact of continuous replication on >>> many databases? >>> >>> Regardless I plan on getting a build of that branch and doing some >>> testing of my own very soon. >>> >>> Thank you! >>> >>> -Chris >> >> I'm pretty sure that even in 1.2.x and master each replication with a re= mote source still requires one dedicated TCP connection to consume the _cha= nges feed. =A0Replications with a local source have always been able to use= a connection pool per host:port combination. =A0That's not to downplay the= significance of the rewrite of the replicator in 1.2.x; Filipe put quite a= lot of time into it. >> >> The link to "those darn errors" just pointed to the mbox browser for Sep= tember 2011. =A0Do you have a more specific link? =A0Regards, >> >> Adam > > Well I will remain optimistic that the rewrite could hopefully have > solved several of my issues regardless I hope. I guess the idle TCP > connections by themselves are not too bad, when they all start to work > simultaneously I think is what becomes the issue =3D) > > Sorry Adam, here is a better link > http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALK= FbxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E, > the actual text was: > > --------------- > > It seems that randomly I am getting errors about crashes as our > replicator runs, all this replicator does is make sure that all > databases on the master server replicate to our failover by checking > status. > > Details: > =A0- I notice the below error in the logs, anywhere from 0 to 30 at a tim= e. > =A0- It seems that a database might start replicating okay then stop. > =A0- These errors [1] are on the failover pulling from master > =A0- No errors are displayed on the master server > =A0- The databases inside the URL in the db_not_found portion of the > error, are always available from curl from the failover machine, which > makes the error strange, somehow it thinks it can't find the database > =A0- Master seems healthy at all times, all database are available, no > errors in log > > [1] -- > =A0[Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>] > {error_report,<0.30.0>, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{<0.22466.5305>,crash_= report, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [[{initial_call,{couc= h_rep,init,['Argument__1']}}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {pid,<0.22466.530= 5>}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {registered_name,= []}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {error_info, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{exit, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {db_not_found= , > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0<<"http://= user:pass@server:5984/db_10944/">>}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 [{gen_server,= init_it,6}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0{proc_lib,= init_p_do_apply,3}]}}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {ancestors, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[couch_rep_sup= ,couch_primary_services, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 couch_server_= sup,<0.31.0>]}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {messages,[]}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {links,[<0.81.0>]= }, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {dictionary,[]}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {trap_exit,true}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {status,running}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {heap_size,2584}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {stack_size,24}, > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {reductions,794}]= , > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0[]]}} > One place I've seen this error pop up when it looks like it shouldn't is if couch_server gets backed up. If you remsh into one of those db's you could try the following: > process_info(whereis(couch_server), message_queue_len). And if that number keeps growing, that could be the issue.