couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Vander Wilt <nat...@gmail.com>
Subject Re: Half-baked idea: incremental virtual databases
Date Mon, 04 Feb 2013 22:14:57 GMT
On Jan 29, 2013, at 5:53 PM, Nathan Vander Wilt wrote:
> So I've heard from both hosting providers that it is fine, but also managed to take both
of their shared services "down" with only about ~100 users (200 continuous filtered replications).
I'm only now at the point where I have tooling to build out arbitrary large tests on my local
machine to see the stats for myself, but as I understand it the issue is that every replication
needs at least one couchjs process to do its filtering for it.
> 
> So rather than inactive users mostly just taking up disk space, they're instead costing
a full-fledged process worth of memory and system resources, each, all the time. As I understand
it, this isn't much better on BigCouch either since the data is scattered ± evenly on each
machine, so while the *computation* is spread, each node in the cluster still needs k*numberOfUsers
couchjs processes running. So it's "scalable" in the sense that traditional databases are
scalable: vertically, by buying machines with more and more memory.
> 
> Again, I am still working on getting a better feel for the costs involved, but the basic
theory with a master-to-N hub is not a great start: every change needs to be processed by
every N replications. So if a user writes a document that ends up in the master database,
every other user's filter function needs to process that change coming out of master. Even
when N users are generating 0 (instead of M) changes, it's not doing M*N work but there's
still always 2*N open connections and supporting processes providing a nasty baseline for
large values of N.

Looks like I was wrong about needing enough RAM for one couchjs process per replication.

CouchDB maintains a pool of (no more than query_server_config/os_process_limit) couchjs processes
and work is divvied out amongst these as necessary. I found a little meta-discussion of this
system at https://issues.apache.org/jira/browse/COUCHDB-1375 and the code uses it here https://github.com/apache/couchdb/blob/master/src/couchdb/couch_query_servers.erl#L299

On my laptop, I was able to spin up 250 users without issue. Beyond that, I start running
into ± hardcoded system resource limits that Erlang has under Mac OS X but from what I've
seen the only theoretical scalability issue with going beyond that on Linux/Windows would
be response times, as the worker processes become more and more saturated.

It still seems wise to implement tiered replications for communicating between thousands of
*active* user databases, but that seems reasonable to me.

thanks,
-natevw
Mime
View raw message