From user-return-18255-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Mon Oct 10 23:04:02 2011 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2533B7FE2 for ; Mon, 10 Oct 2011 23:04:02 +0000 (UTC) Received: (qmail 74875 invoked by uid 500); 10 Oct 2011 23:03:59 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 74741 invoked by uid 500); 10 Oct 2011 23:03:59 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 74698 invoked by uid 99); 10 Oct 2011 23:03:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 23:03:59 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of chrisstocktonaz@gmail.com designates 209.85.210.180 as permitted sender) Received: from [209.85.210.180] (HELO mail-iy0-f180.google.com) (209.85.210.180) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Oct 2011 23:03:51 +0000 Received: by iakc1 with SMTP id c1so6596478iak.11 for ; Mon, 10 Oct 2011 16:03:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=0nypEgpzVIKYRKS/hQ8AwCIghyNNuM+IPMnXPg4gfQ0=; b=E0YhtJX0e8t3cRerhBhIWqomUgPPiVnIBvkM6pd7LDW60ekXFjrmnqiW0JIrGEhmzp HtUdvjbRyXWcO8H9/KB+/qOXxZ4iXN1xT5XnmMG0KzbMnVDNiiTGEgU9D/GVtupRIazo fNJLlx1Kd+ZIrteyB9amQ6lc1dJg6/0GIJ1SE= MIME-Version: 1.0 Received: by 10.42.147.196 with SMTP id o4mr11423181icv.33.1318287810582; Mon, 10 Oct 2011 16:03:30 -0700 (PDT) Received: by 10.42.170.132 with HTTP; Mon, 10 Oct 2011 16:03:30 -0700 (PDT) Date: Mon, 10 Oct 2011 16:03:30 -0700 Message-ID: Subject: CouchDB Replication lacking resilience for many database From: Chris Stockton To: user@couchdb.apache.org, dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hello, I will try to keep this short as possible, the condensed version of this is I am looking to the developers and experienced users of couchdb to offer me some guidance so I can plan how to proceed in solving our current growth and capacity issues in regards to replication. I have been using couchdb for about a year now in a production environment and nearing 2 years developing with it total, as our customer base has grown to thousands of users, couchdb has kept up with reads, writes and general performance pretty well. There has been one constant pain for us that has used a lot of system operation and developer time, and that is replication. Our system is broken into pods, each pod has 3 web servers, a couchdb master server (all writes and reads come from here), and a couchdb failover server which pulls from the master, ONLY for redundancy, not reads. We have over 5000 databases, and a application that runs that watches the STATUS on the failover machine to make sure new added databases and existing databases are running a "continuous" replication task. This means, 5000 replication tasks running at all times along with the tax and resources needed for these tasks. I have had trouble with planning out how to scale and tune our systems to allow couchdb to push the large machines it is running on to their full potential. Although I was able to find some general information on erlang VM tuning thankfully which has allowed us to increase our limit and give us "new errors", we still seem to be unable to get everything we want out of these machines yet, we never peg CPU, memory etc, just hit those darn errors in [1]. Since no real world studies, white papers or general resources for tackling enterprise level capacity issues exist for CouchDB I feel a bit lost. This is something I am wanting to change and am very willing to take part in. A couple posts about my struggles thus far are located here [1] and here [2], I have heard from these threads that a individual has been working on major changes to replication, to use a small pool of TCP connections instead of a 1 to 1 ration. I think this would improve our situation drastically, but I was unable to find said work anywhere in SVN or on the web to test or look into it's design concepts. If said work is not actually under development yet, I am willing to put in some development time on the weekends to learn erlang and improve this portion of couchdb, but I do not want to volunteer this time if the work has already been complete. Basically at this point, the 3 am NOC calls and daily restarts of couchdb from the [1] error causing dbs to stop replicating are really impacting me. I need to find a solution for this, start coding a erlang patch of my own or look at much less lucrative options... CouchDB is a great product and I am glad we chose it, I just need a little help here.. The things I have thought of or considered: 1) Create / implement 'server wide' replication within erlang, seems like a place where such a task could be optimized and made much lighter 2) I am doing it wrong!!! I should not have a application ensuring all 5000 dbs have a continuous replication task, but instead keep a small pool of continuous replication for the "active" databases or top / busy users, the changes feed or other things might be of use here. The only problem with this is it isn't entirely fair to the customers if the master dies and we failover, that the busy / active database would take precedence over the little guy. 3) Give up or change our storage architecture because couchdb is unable to provide replication stability for our use patterns. Thank you very much for any suggestions, thoughts or resources you can point me to, -Chris [1] http://mail-archives.apache.org/mod_mbox/couchdb-user/201109.mbox/browser [2] http://mail-archives.apache.org/mod_mbox/couchdb-user/201105.mbox/%3CBANLkTimetq8nvwj4J356cMo0FcWMN=TUhw@mail.gmail.com%3E