Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1283)
Subject: Re: Round-robin replication [was Half-baked idea: incremental virtual
 databases]
From: Nathan Vander Wilt <nate-lists@calftrail.com>
In-Reply-To: 
 <CABvT1DGX1WkM3VnraToWATRVRW=Mqk_rj5yi7m1cSvUN-2nV1g@mail.gmail.com>
Date: Tue, 5 Feb 2013 10:38:45 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <182A1A01-F47A-4C42-9E1B-9443329D100B@calftrail.com>
References: <E396884A-6331-483A-94B5-AB4A93664864@calftrail.com>
 <DB51C5B5-85C9-401A-ACE5-88C28894F51D@gmail.com>
 <1EE3C606-CD9B-4B3E-BFDE-32795DEBB1DB@calftrail.com>
 <70278F4A-FD08-4818-89B7-EA1B0AF846F5@gmail.com>
 <AD14D3A0-AA63-4334-976E-262CC811431E@apache.org>
 <CABvT1DGX1WkM3VnraToWATRVRW=Mqk_rj5yi7m1cSvUN-2nV1g@mail.gmail.com>
To: dev@couchdb.apache.org

+1 on round-robin continuous replication. Ideally it'd allow =
replications to specify a relative priority, e.g. replication of log =
alerts or chat messages might desire lower latency (higher priority) =
than a ddoc deployment or user backup. For now I'm going to just =
implement my own duct tape version of this, using cron jobs to trigger =
non-continuous replications.

FWIW, I'm sharing with my client's permission the script I've been using =
to load test continuous filtered replication to/from a central master:
https://gist.github.com/natevw/4711127

The test script sets up N+1 databases, writes documents to 1 as the =
master while replicating to the other N as well as "short-polling" the =
_changes to kinda simulate general load on top of the longpolling the =
application does. On OS X I can only get to around 250 users due to some =
FD_SETSIZE stuff with Erlang, but it remains stable if I keep it under =
that limit =97 however, it takes the user database replications a *long* =
time to all get caught up (some don't even start until a few minutes =
after the changes stop).

hth,
-natevw


On Feb 4, 2013, at 2:50 PM, Robert Newson wrote:

> I had a mind to teach the _replicator db this trick. Since we have a
> record of everything we need to resume a replication there's no reason
> for a one-to-one correspondence between a _replicator doc and a
> replicator process. We can simply run N of them for a bit (say, a
> batch of 1000 updates) and then switch to others. The internal
> db_updated mechanism is a good way to notice when we might have
> updates worth sending but it's only half the story. A round-robin over
> all _replicator docs (other than one-shot ones, of course) seems a
> really neat trick to me.
>=20
> B.
>=20
> On 4 February 2013 22:39, Jan Lehnardt <jan@apache.org> wrote:
>>=20
>> On Feb 4, 2013, at 23:14 , Nathan Vander Wilt <natevw@gmail.com> =
wrote:
>>=20
>>> On Jan 29, 2013, at 5:53 PM, Nathan Vander Wilt wrote:
>>>> So I've heard from both hosting providers that it is fine, but also =
managed to take both of their shared services "down" with only about =
~100 users (200 continuous filtered replications). I'm only now at the =
point where I have tooling to build out arbitrary large tests on my =
local machine to see the stats for myself, but as I understand it the =
issue is that every replication needs at least one couchjs process to do =
its filtering for it.
>>>>=20
>>>> So rather than inactive users mostly just taking up disk space, =
they're instead costing a full-fledged process worth of memory and =
system resources, each, all the time. As I understand it, this isn't =
much better on BigCouch either since the data is scattered =B1 evenly on =
each machine, so while the *computation* is spread, each node in the =
cluster still needs k*numberOfUsers couchjs processes running. So it's =
"scalable" in the sense that traditional databases are scalable: =
vertically, by buying machines with more and more memory.
>>>>=20
>>>> Again, I am still working on getting a better feel for the costs =
involved, but the basic theory with a master-to-N hub is not a great =
start: every change needs to be processed by every N replications. So if =
a user writes a document that ends up in the master database, every =
other user's filter function needs to process that change coming out of =
master. Even when N users are generating 0 (instead of M) changes, it's =
not doing M*N work but there's still always 2*N open connections and =
supporting processes providing a nasty baseline for large values of N.
>>>=20
>>> Looks like I was wrong about needing enough RAM for one couchjs =
process per replication.
>>>=20
>>> CouchDB maintains a pool of (no more than =
query_server_config/os_process_limit) couchjs processes and work is =
divvied out amongst these as necessary. I found a little meta-discussion =
of this system at https://issues.apache.org/jira/browse/COUCHDB-1375 and =
the code uses it here =
https://github.com/apache/couchdb/blob/master/src/couchdb/couch_query_serv=
ers.erl#L299
>>>=20
>>> On my laptop, I was able to spin up 250 users without issue. Beyond =
that, I start running into =B1 hardcoded system resource limits that =
Erlang has under Mac OS X but from what I've seen the only theoretical =
scalability issue with going beyond that on Linux/Windows would be =
response times, as the worker processes become more and more saturated.
>>>=20
>>> It still seems wise to implement tiered replications for =
communicating between thousands of *active* user databases, but that =
seems reasonable to me.
>>=20
>> CouchDB=92s design is obviously lacking here.
>>=20
>> For immediate relief, I=92ll propose the usual jackhammer of =
unpopular responses: write your filters in Erlang. (sorry :)
>>=20
>> For the future: we already see progress in improving the view server =
situation. Once we get to a more desirable setup (yaynode/v8), we can =
improve the view server communication, there is no reason you=92d need a =
single JS OS process per active replication and we should absolutely fix =
that.
>>=20
>> --
>>=20
>> Another angle is the replicator. I know Jason Smith has a prototype =
of this in Node, it works. Instead of maintaining N active replications, =
we just keep a maximum number of active connections and cycle out ones =
that are currently inactive. The DbUpdateNotification mechanism should =
make this relatively straightforward. There is added overhead for =
setting up and tearing down replications, but we can make better use of =
resources and not clog things with inactive replications. Especially in =
a db-per-user scenario, most replications don=92t see much of an update =
most of the time, they should be inactive until data is written to any =
of the source databases. The mechanics in CouchDB are all there for =
this, we just need to write it.
>>=20
>> --
>>=20
>> Nate, thanks for sharing our findings and for bearing with us, =
despite your very understandable frustrations. It is people like you =
that allow us to make CouchDB better!
>>=20
>> Best
>> Jan
>> --
>>=20
>>=20
>>=20