From user-return-18268-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Tue Oct 11 14:03:54 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E8DC997D9 for ; Tue, 11 Oct 2011 14:03:54 +0000 (UTC) Received: (qmail 44752 invoked by uid 500); 11 Oct 2011 14:03:52 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 44592 invoked by uid 500); 11 Oct 2011 14:03:52 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 44577 invoked by uid 99); 11 Oct 2011 14:03:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 14:03:52 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [80.244.253.218] (HELO mail.traeumt.net) (80.244.253.218) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Oct 2011 14:03:42 +0000 Received: from [192.168.178.23] (unknown [89.244.102.76]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.traeumt.net (Postfix) with ESMTPSA id 47BDA3C366; Tue, 11 Oct 2011 16:03:22 +0200 (CEST) Subject: Re: CouchDB Replication lacking resilience for many database Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: text/plain; charset=iso-8859-1 From: Jan Lehnardt In-Reply-To: Date: Tue, 11 Oct 2011 16:03:21 +0200 Cc: dev@couchdb.apache.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <2903A6F8-BF28-4DA8-80CC-B7B45ADD4057@apache.org> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1244.3) X-Virus-Checked: Checked by ClamAV on apache.org On Oct 11, 2011, at 14:20 , Mark Hahn wrote: > It would be nice to have a control panel that displays things like = this > message queue depth, connection counts, memory consumed, cpu consumed, > reads/writes per second, view rebuilds/sec, avg response times, etc. = I'm > sure someone could come up with many more pertinent vars. >=20 > For extra credit the values could be plotted against time. When = someone has > a problem they could post the log here. See /_stats :) It doesn't have all the things you ask for, but adding new stats isn't = hard:=20 http://wiki.apache.org/couchdb/Adding_Runtime_Statistics Cheers Jan --=20 >=20 > On Mon, Oct 10, 2011 at 10:15 PM, Paul Davis = wrote: >=20 >> On Mon, Oct 10, 2011 at 11:03 PM, Chris Stockton >> wrote: >>> Hello, >>>=20 >>> On Mon, Oct 10, 2011 at 5:18 PM, Adam Kocoloski = >> wrote: >>>> On Oct 10, 2011, at 8:02 PM, Chris Stockton wrote: >>>>=20 >>>>> Hello, >>>>>=20 >>>>> On Mon, Oct 10, 2011 at 4:19 PM, Filipe David Manana >>>>> wrote: >>>>>> On Tue, Oct 11, 2011 at 12:03 AM, Chris Stockton >>>>>> wrote: >>>>>> Chris, >>>>>>=20 >>>>>> That said work is in the'1.2.x' branch (and master). >>>>>> CouchDB recently migrated from SVN to GIT, see: >>>>>> http://couchdb.apache.org/community/code.html >>>>>>=20 >>>>>=20 >>>>> Thank you very much for the response Filipe, do you possibly have = any >>>>> documentation or more detailed summary on what these changes = include >>>>> and possible benefits of them? I would love to hear about any = tweaking >>>>> or replication tips you may have for our growth issues, perhaps = you >>>>> could answer a basic question if nothing else: Do the changes in = this >>>>> branch minimize the performance impact of continuous replication = on >>>>> many databases? >>>>>=20 >>>>> Regardless I plan on getting a build of that branch and doing some >>>>> testing of my own very soon. >>>>>=20 >>>>> Thank you! >>>>>=20 >>>>> -Chris >>>>=20 >>>> I'm pretty sure that even in 1.2.x and master each replication with = a >> remote source still requires one dedicated TCP connection to consume = the >> _changes feed. Replications with a local source have always been = able to >> use a connection pool per host:port combination. That's not to = downplay the >> significance of the rewrite of the replicator in 1.2.x; Filipe put = quite a >> lot of time into it. >>>>=20 >>>> The link to "those darn errors" just pointed to the mbox browser = for >> September 2011. Do you have a more specific link? Regards, >>>>=20 >>>> Adam >>>=20 >>> Well I will remain optimistic that the rewrite could hopefully have >>> solved several of my issues regardless I hope. I guess the idle TCP >>> connections by themselves are not too bad, when they all start to = work >>> simultaneously I think is what becomes the issue =3D) >>>=20 >>> Sorry Adam, here is a better link >>>=20 >> = http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/%3CCALKF= bxuugLJJY-NH46U0u584L+XDqM3NGSpeNxsJyrxosPEuCg@mail.gmail.com%3E >> , >>> the actual text was: >>>=20 >>> --------------- >>>=20 >>> It seems that randomly I am getting errors about crashes as our >>> replicator runs, all this replicator does is make sure that all >>> databases on the master server replicate to our failover by checking >>> status. >>>=20 >>> Details: >>> - I notice the below error in the logs, anywhere from 0 to 30 at a = time. >>> - It seems that a database might start replicating okay then stop. >>> - These errors [1] are on the failover pulling from master >>> - No errors are displayed on the master server >>> - The databases inside the URL in the db_not_found portion of the >>> error, are always available from curl from the failover machine, = which >>> makes the error strange, somehow it thinks it can't find the = database >>> - Master seems healthy at all times, all database are available, no >>> errors in log >>>=20 >>> [1] -- >>> [Mon, 12 Sep 2011 18:34:14 GMT] [error] [<0.22466.5305>] >>> {error_report,<0.30.0>, >>> {<0.22466.5305>,crash_report, >>>=20 >> [[{initial_call,{couch_rep,init,['Argument__1']}}, >>> {pid,<0.22466.5305>}, >>> {registered_name,[]}, >>> {error_info, >>> {exit, >>> {db_not_found, >>> <<"http://user:pass@server >> :5984/db_10944/">>}, >>> [{gen_server,init_it,6}, >>> {proc_lib,init_p_do_apply,3}]}}, >>> {ancestors, >>> [couch_rep_sup,couch_primary_services, >>> couch_server_sup,<0.31.0>]}, >>> {messages,[]}, >>> {links,[<0.81.0>]}, >>> {dictionary,[]}, >>> {trap_exit,true}, >>> {status,running}, >>> {heap_size,2584}, >>> {stack_size,24}, >>> {reductions,794}], >>> []]}} >>>=20 >>=20 >> One place I've seen this error pop up when it looks like it shouldn't >> is if couch_server gets backed up. If you remsh into one of those = db's >> you could try the following: >>=20 >>> process_info(whereis(couch_server), message_queue_len). >>=20 >> And if that number keeps growing, that could be the issue. >>=20