couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Shorin (JIRA)" <>
Subject [jira] [Commented] (COUCHDB-2240) Many continuous replications cause DOS
Date Sat, 17 May 2014 00:54:15 GMT


Alexander Shorin commented on COUCHDB-2240:

Resource usage limiting is always good idea since it protects you from DoS of whole your system
due to some app/service (CouchDB in our case) consumed everything around. The common and the
best strategy is to monitor open_databases value with using max_dbs_open as critical level
value (you would like to set also warning level as 80%-90% of that value) and increase in
by the fact. 

The similar problem is with file descriptors: you can set ulimit for them to infinity, but
that's not very wise idea. Better monitor their usage and set those limits that actually fits
your environment.

> Many continuous replications cause DOS
> --------------------------------------
>                 Key: COUCHDB-2240
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>            Reporter: Eli Stevens
> Currently, I can configure an arbitrary number of replications between localhost DBs
(in my case, they are in the _replicator DB with continuous set to true). However, there is
a limit beyond which requests to the DB start to fail.  Trying to do another replication fails
with the error:
> ServerError: (500, ('checkpoint_commit_failure', "Target database out of sync. Try to
increase max_dbs_open at the target's server."))
> Due to COUCHDB-2239, it's not clear what the actual issue is. 
> I also believe that while the DB was in this state GET requests to documents were also
failing, but the machine that has the logs of this has already had it's drives wiped. If need
be, I can recreate the situation and provide those logs as well.
> I think that instead of there being a single fixed pool of resources that cause errors
when exhausted, the system should have a per-task-type pool of resources that result in performance
degradation when exhausted. N replication workers with P DB connections, and if that's not
enough they start to round-robin; that sort of thing. When a user has too much to replicate,
it gets slow instead of failing.
> As it stands now, I have a potentially large number of continuous replications that produce
a fixed rate of data to replicate (because there's a fixed application worker pool that writes
the data in the first place). We use a DB+replication per batch of data to process, and if
we receive a burst of batches, then couchdb starts failing. The current setup means that I'm
always going to be playing chicken between burst size and whatever setting limit we're hitting.
 That sucks, and isn't acceptable for a production system, so we're going to have to re-architect
how we do replication, and basically implement poor-man's continuous by doing one off replications
at various points of our data processing runs.

This message was sent by Atlassian JIRA

View raw message