couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Stockton <>
Subject CouchDB Replication lacking resilience for many database
Date Mon, 10 Oct 2011 23:03:30 GMT

I will try to keep this short as possible, the condensed version of
this is I am looking to the developers and experienced users of
couchdb to offer me some guidance so I can plan how to proceed in
solving our current growth and capacity issues in regards to

I have been using couchdb for about a year now in a production
environment and nearing 2 years developing with it total, as our
customer base has grown to thousands of users, couchdb has kept up
with reads, writes and general performance pretty well. There has been
one constant pain for us that has used a lot of system operation and
developer time, and that is replication.

Our system is broken into pods, each pod has 3 web servers, a couchdb
master server (all writes and reads come from here), and a couchdb
failover server which pulls from the master, ONLY for redundancy, not
reads. We have over 5000 databases, and a application that runs that
watches the STATUS on the failover machine to make sure new added
databases and existing databases are running a "continuous"
replication task. This means, 5000 replication tasks running at all
times along with the tax and resources needed for these tasks. I have
had trouble with planning out how to scale and tune our systems to
allow couchdb to push the large machines it is running on to their
full potential. Although I was able to find some general information
on erlang VM tuning thankfully which has allowed us to increase our
limit and give us "new errors", we still seem to be unable to get
everything we want out of these machines yet, we never peg CPU, memory
etc, just hit those darn errors in [1]. Since no real world studies,
white papers or general resources for tackling enterprise level
capacity issues exist for CouchDB I feel a bit lost. This is something
I am wanting to change and am very willing to take part in.

A couple posts about my struggles thus far are located here [1] and
here [2], I have heard from these threads that a individual has been
working on major changes to replication, to use a small pool of TCP
connections instead of a 1 to 1 ration. I think this would improve our
situation drastically, but I was unable to find said work anywhere in
SVN or on the web to test or look into it's design concepts. If said
work is not actually under development yet, I am willing to put in
some development time on the weekends to learn erlang and improve this
portion of couchdb, but I do not want to volunteer this time if the
work has already been complete.

Basically at this point, the 3 am NOC calls and daily restarts of
couchdb from the [1] error causing dbs to stop replicating are really
impacting me. I need to find a solution for this, start coding a
erlang patch of my own or look at much less lucrative options...
CouchDB is a great product and I am glad we chose it, I just need a
little help here..

The things I have thought of or considered:
  1) Create / implement 'server wide' replication within erlang, seems
like a place where such a task could be optimized and made much
  2) I am doing it wrong!!! I should not have a application ensuring
all 5000 dbs have a continuous replication task, but instead keep a
small pool of continuous replication for the "active" databases or top
/ busy users, the changes feed or other things might be of use here.
The only problem with this is it isn't entirely fair to the customers
if the master dies and we failover, that the busy / active database
would take precedence over the little guy.
  3) Give up or change our storage architecture because couchdb is
unable to provide replication stability for our use patterns.

Thank you very much for any suggestions, thoughts or resources you can
point me to,



View raw message