couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Stevens (JIRA)" <>
Subject [jira] [Created] (COUCHDB-2240) Many continuous replications cause DOS
Date Fri, 16 May 2014 10:25:28 GMT
Eli Stevens created COUCHDB-2240:

             Summary: Many continuous replications cause DOS
                 Key: COUCHDB-2240
             Project: CouchDB
          Issue Type: Bug
      Security Level: public (Regular issues)
            Reporter: Eli Stevens

Currently, I can configure an arbitrary number of replications between localhost DBs (in my
case, they are in the _replicator DB with continuous set to true). However, there is a limit
beyond which requests to the DB start to fail.  Trying to do another replication fails with
the error:

ServerError: (500, ('checkpoint_commit_failure', "Target database out of sync. Try to increase
max_dbs_open at the target's server."))

Due to COUCHDB-2239, it's not clear what the actual issue is. 

I also believe that while the DB was in this state GET requests to documents were also failing,
but the machine that has the logs of this has already had it's drives wiped. If need be, I
can recreate the situation and provide those logs as well.

I think that instead of there being a single fixed pool of resources that cause errors when
exhausted, the system should have a per-task-type pool of resources that result in performance
degradation when exhausted. N replication workers with P DB connections, and if that's not
enough they start to round-robin; that sort of thing. When a user has too much to replicate,
it gets slow instead of failing.

As it stands now, I have a potentially large number of continuous replications that produce
a fixed rate of data to replicate (because there's a fixed application worker pool that writes
the data in the first place). We use a DB+replication per batch of data to process, and if
we receive a burst of batches, then couchdb starts failing. The current setup means that I'm
always going to be playing chicken between burst size and whatever setting limit we're hitting.
 That sucks, and isn't acceptable for a production system, so we're going to have to re-architect
how we do replication, and basically implement poor-man's continuous by doing one off replications
at various points of our data processing runs.

This message was sent by Atlassian JIRA

View raw message