Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB7C3E453 for ; Tue, 5 Feb 2013 18:39:13 +0000 (UTC) Received: (qmail 80669 invoked by uid 500); 5 Feb 2013 18:39:13 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 80609 invoked by uid 500); 5 Feb 2013 18:39:13 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 80601 invoked by uid 99); 5 Feb 2013 18:39:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2013 18:39:13 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.220.42] (HELO mail-pa0-f42.google.com) (209.85.220.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Feb 2013 18:39:06 +0000 Received: by mail-pa0-f42.google.com with SMTP id kq12so296088pab.1 for ; Tue, 05 Feb 2013 10:38:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=/RF0/HiWISPzjVORtBlovmNcoOB2fGBjGTr7hQqQ0SU=; b=Go9A9LSHkjlBW4KP0qhicf/eQ9b8uM0ard3uNN6JEtuiioNmt4Wb+SbCNlHVKOQ/d1 S9Lp1D6pTEwFiwhfGPekrwEet1gqAxpUE6Q9blog40Bw96ZGrmb91u4SEyZAonwKOhaV Fcof7Y6wNuRdaEyHRNlCnGkYo7Vc31GdAo1qS/hT5vfbmDnKvIavT+D4lK80Hj1yAWye IjevrgQg/rbUyNhDv3jmI6vbqtQJ0cj6lJeg44YWzwsR7mNexSuIJd4ONDFGrXKWFFRN wH43PKIsw4Go5id+yGvfx1fd/sVJwPlWFkbZPhtdHvV7EMJSvEncx0ykMEz0DP7CJnYC y9mg== X-Received: by 10.66.81.198 with SMTP id c6mr66866307pay.50.1360089526262; Tue, 05 Feb 2013 10:38:46 -0800 (PST) Received: from [192.168.13.23] (71-84-176-101.dhcp.mdfd.or.charter.com. [71.84.176.101]) by mx.google.com with ESMTPS id y9sm31290064paw.1.2013.02.05.10.38.44 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 05 Feb 2013 10:38:45 -0800 (PST) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1283) Subject: Re: Round-robin replication [was Half-baked idea: incremental virtual databases] From: Nathan Vander Wilt In-Reply-To: Date: Tue, 5 Feb 2013 10:38:45 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: <182A1A01-F47A-4C42-9E1B-9443329D100B@calftrail.com> References: <1EE3C606-CD9B-4B3E-BFDE-32795DEBB1DB@calftrail.com> <70278F4A-FD08-4818-89B7-EA1B0AF846F5@gmail.com> To: dev@couchdb.apache.org X-Mailer: Apple Mail (2.1283) X-Gm-Message-State: ALoCoQmOE62/1fVstIhKWJTknMqkh0CSNkH/Pbr/w1jsm0k0jZzwmOWTWoMSaDMce9YYuOsd56yF X-Virus-Checked: Checked by ClamAV on apache.org +1 on round-robin continuous replication. Ideally it'd allow = replications to specify a relative priority, e.g. replication of log = alerts or chat messages might desire lower latency (higher priority) = than a ddoc deployment or user backup. For now I'm going to just = implement my own duct tape version of this, using cron jobs to trigger = non-continuous replications. FWIW, I'm sharing with my client's permission the script I've been using = to load test continuous filtered replication to/from a central master: https://gist.github.com/natevw/4711127 The test script sets up N+1 databases, writes documents to 1 as the = master while replicating to the other N as well as "short-polling" the = _changes to kinda simulate general load on top of the longpolling the = application does. On OS X I can only get to around 250 users due to some = FD_SETSIZE stuff with Erlang, but it remains stable if I keep it under = that limit =97 however, it takes the user database replications a *long* = time to all get caught up (some don't even start until a few minutes = after the changes stop). hth, -natevw On Feb 4, 2013, at 2:50 PM, Robert Newson wrote: > I had a mind to teach the _replicator db this trick. Since we have a > record of everything we need to resume a replication there's no reason > for a one-to-one correspondence between a _replicator doc and a > replicator process. We can simply run N of them for a bit (say, a > batch of 1000 updates) and then switch to others. The internal > db_updated mechanism is a good way to notice when we might have > updates worth sending but it's only half the story. A round-robin over > all _replicator docs (other than one-shot ones, of course) seems a > really neat trick to me. >=20 > B. >=20 > On 4 February 2013 22:39, Jan Lehnardt wrote: >>=20 >> On Feb 4, 2013, at 23:14 , Nathan Vander Wilt = wrote: >>=20 >>> On Jan 29, 2013, at 5:53 PM, Nathan Vander Wilt wrote: >>>> So I've heard from both hosting providers that it is fine, but also = managed to take both of their shared services "down" with only about = ~100 users (200 continuous filtered replications). I'm only now at the = point where I have tooling to build out arbitrary large tests on my = local machine to see the stats for myself, but as I understand it the = issue is that every replication needs at least one couchjs process to do = its filtering for it. >>>>=20 >>>> So rather than inactive users mostly just taking up disk space, = they're instead costing a full-fledged process worth of memory and = system resources, each, all the time. As I understand it, this isn't = much better on BigCouch either since the data is scattered =B1 evenly on = each machine, so while the *computation* is spread, each node in the = cluster still needs k*numberOfUsers couchjs processes running. So it's = "scalable" in the sense that traditional databases are scalable: = vertically, by buying machines with more and more memory. >>>>=20 >>>> Again, I am still working on getting a better feel for the costs = involved, but the basic theory with a master-to-N hub is not a great = start: every change needs to be processed by every N replications. So if = a user writes a document that ends up in the master database, every = other user's filter function needs to process that change coming out of = master. Even when N users are generating 0 (instead of M) changes, it's = not doing M*N work but there's still always 2*N open connections and = supporting processes providing a nasty baseline for large values of N. >>>=20 >>> Looks like I was wrong about needing enough RAM for one couchjs = process per replication. >>>=20 >>> CouchDB maintains a pool of (no more than = query_server_config/os_process_limit) couchjs processes and work is = divvied out amongst these as necessary. I found a little meta-discussion = of this system at https://issues.apache.org/jira/browse/COUCHDB-1375 and = the code uses it here = https://github.com/apache/couchdb/blob/master/src/couchdb/couch_query_serv= ers.erl#L299 >>>=20 >>> On my laptop, I was able to spin up 250 users without issue. Beyond = that, I start running into =B1 hardcoded system resource limits that = Erlang has under Mac OS X but from what I've seen the only theoretical = scalability issue with going beyond that on Linux/Windows would be = response times, as the worker processes become more and more saturated. >>>=20 >>> It still seems wise to implement tiered replications for = communicating between thousands of *active* user databases, but that = seems reasonable to me. >>=20 >> CouchDB=92s design is obviously lacking here. >>=20 >> For immediate relief, I=92ll propose the usual jackhammer of = unpopular responses: write your filters in Erlang. (sorry :) >>=20 >> For the future: we already see progress in improving the view server = situation. Once we get to a more desirable setup (yaynode/v8), we can = improve the view server communication, there is no reason you=92d need a = single JS OS process per active replication and we should absolutely fix = that. >>=20 >> -- >>=20 >> Another angle is the replicator. I know Jason Smith has a prototype = of this in Node, it works. Instead of maintaining N active replications, = we just keep a maximum number of active connections and cycle out ones = that are currently inactive. The DbUpdateNotification mechanism should = make this relatively straightforward. There is added overhead for = setting up and tearing down replications, but we can make better use of = resources and not clog things with inactive replications. Especially in = a db-per-user scenario, most replications don=92t see much of an update = most of the time, they should be inactive until data is written to any = of the source databases. The mechanics in CouchDB are all there for = this, we just need to write it. >>=20 >> -- >>=20 >> Nate, thanks for sharing our findings and for bearing with us, = despite your very understandable frustrations. It is people like you = that allow us to make CouchDB better! >>=20 >> Best >> Jan >> -- >>=20 >>=20 >>=20