couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Stevens (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (COUCHDB-2240) Many continuous replications cause DOS
Date Sat, 17 May 2014 00:05:00 GMT

    [ https://issues.apache.org/jira/browse/COUCHDB-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000535#comment-14000535
] 

Eli Stevens commented on COUCHDB-2240:
--------------------------------------

Depends on your settings. My understanding of what's causing the issue is that the default
value of max_open_dbs is 100. I can raise that value arbitrarily high and hope that I never
need to process a burst of activity greater than my arbitrarily high value, but that's not
really desirable either, since I would much prefer to have an arbitrarily high number of replications
going that only use a fixed pool of resources (and if those resources end up taxed, then performance
degrades).

I understand that this is not a small change that I'm suggesting.

> Many continuous replications cause DOS
> --------------------------------------
>
>                 Key: COUCHDB-2240
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2240
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>            Reporter: Eli Stevens
>
> Currently, I can configure an arbitrary number of replications between localhost DBs
(in my case, they are in the _replicator DB with continuous set to true). However, there is
a limit beyond which requests to the DB start to fail.  Trying to do another replication fails
with the error:
> ServerError: (500, ('checkpoint_commit_failure', "Target database out of sync. Try to
increase max_dbs_open at the target's server."))
> Due to COUCHDB-2239, it's not clear what the actual issue is. 
> I also believe that while the DB was in this state GET requests to documents were also
failing, but the machine that has the logs of this has already had it's drives wiped. If need
be, I can recreate the situation and provide those logs as well.
> I think that instead of there being a single fixed pool of resources that cause errors
when exhausted, the system should have a per-task-type pool of resources that result in performance
degradation when exhausted. N replication workers with P DB connections, and if that's not
enough they start to round-robin; that sort of thing. When a user has too much to replicate,
it gets slow instead of failing.
> As it stands now, I have a potentially large number of continuous replications that produce
a fixed rate of data to replicate (because there's a fixed application worker pool that writes
the data in the first place). We use a DB+replication per batch of data to process, and if
we receive a burst of batches, then couchdb starts failing. The current setup means that I'm
always going to be playing chicken between burst size and whatever setting limit we're hitting.
 That sucks, and isn't acceptable for a production system, so we're going to have to re-architect
how we do replication, and basically implement poor-man's continuous by doing one off replications
at various points of our data processing runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message