Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com
 designates 209.85.217.211 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=lO5ANYU3g2UD40XZDi33VzsFjpv+W4HX9Crgegkx7xVfASsVMbN1WxugtfonCLi5OK
         bNApHDckyoScoqO3YNKG5NCjFhAZHXM5Zf3tu3SZAQZ/3arKF7wDNeCqGZA8fx77ZQtU
         sm+f2h9lgjIQUdYqp85+Ng0piafYXPbQgAKIA=
MIME-Version: 1.0
In-Reply-To: <4AE5C134.7000108@meetinghouse.net>
References: <4AE2028F.9030502@meetinghouse.net>
 <0C6A665F-53C0-41ED-9A8E-C00098B45174@apache.org>
	<4AE20CD6.7010500@meetinghouse.net>
 <C930CF04-4C4B-4493-A9E0-9243B192654B@googlemail.com>
	<4AE2247B.1010803@meetinghouse.net>
 <e282921e0910231642j1463112bk4f5e2e8b3efb937@mail.gmail.com>
	<4AE5B610.90504@meetinghouse.net>
 <A9441CA9-0D47-4A47-B7AA-F44AF81D4960@apache.org>
	<4AE5C134.7000108@meetinghouse.net>
From: Paul Davis <paul.joseph.davis@gmail.com>
Date: Mon, 26 Oct 2009 11:57:39 -0400
Message-ID: <e2111bbb0910260857r66f95ef9iccbfede138bafbf1@mail.gmail.com>
Subject: Re: massive replication?
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Miles,

One the one hand, it sounds like you could solve this with XMPP and
use CouchDB as a backing store. People have already connected RabbitMQ
and CouchDB, I can't imagine that connecting ejabberd and CouchDB
would be much harder. The pubsub extensions could very much be what
you're wanting.

Though, that doesn't let you write awesome distributed routing
protocols. So, really, what fun could that be?

I'm basically thinking something along the lines of:

1. Initial peer discovery is seeded or auto discovered from multicast
messages. Or some combination thereof depending on your network
topology.
2. Nodes gossip information they've found through replication with
nodes they know about. Something like, create a database that contains
network connections and replicate that along side with your data. A
view on this database would calculate each subscription end point's
highest known update_seq and who has it.
3. Each subscription end point is a database that can be replicated.

There are at least two hard parts I can spot right now:

1. Replication currently has no endpoint for detection. We don't
expose a list of _local/docs to anything so detecting when a node is
replicating too us would be harder. Adding a custom patch to alert a
db_update_notification process with _local/doc updates would be fairly
trivial.

2. The fun part of replication strategy in trying to maximize network
throughput, avoid stampeding a single updater, etc etc. There's enough
literature on this to get a general idea of where to go, but it'd
still probably take some tuning and experimentation. But with plenty
of monitoring and some foresight, giving your self the ability to
tweak system parameters and measure the effects would be quite a lot
of fun. ... Holy crap, I'm contemplating writing the basic routing
logic in Python/Ruby and replicating it out to the network, then as
each node gets a copy the router replaces its logic dynamically. Now
that'd be just plain awesome.

Note that this sort of protocol doesn't attempt to avoid bandwidth
usage by multi-casting documents which may be a non-starter depending
on your topology and characteristics. Instead the idea is to figure
out how to play with your network topology to react and match your
physical link network yet with enough redundancy that it doesn't break
as physical links come and go.

Granted, that's all off the top of my head, so until I sit down and
implement that on a couple hundred nodes it may well be a bunch of hot
air. :)

Paul Davis

On Mon, Oct 26, 2009 at 11:33 AM, Miles Fidelman
<mfidelman@meetinghouse.net> wrote:
> Adam Kocoloski wrote:
>>
>> On Oct 26, 2009, at 10:45 AM, Miles Fidelman wrote:
>>
>>> The environment we're looking at is more of a mesh where connectivity i=
s
>>> coming up and down - think mobile ad hoc networks.
>>>
>>> I like the idea of a replication bus, perhaps using something like spre=
ad
>>> (http://www.spread.org/) or spines (www.spines.org) as a multi-cast fab=
ric.
>>>
>>> I'm thinking something like continuous replication - but where the
>>> updates are pushed to a multi-cast port rather than to a specific node,=
 with
>>> each node subscribing to update feeds.
>>>
>>> Anybody have any thoughts on how that would play with the current
>>> replication and conflict resolution schemes?
>>>
>>> Miles Fidelman
>>
>> Hi Miles, this sounds like really cool stuff. =A0Caveat: I have no
>> experience using Spread/Spines and very little experience with IP
>> multicasting, which I guess is what those tools try to reproduce in
>> internet-like environments. =A0So bear with me if I ask stupid questions=
.
>>
>> 1) Would the CouchDB servers be responsible for error detection and
>> correction? =A0I imagine that complicates matters considerably, but it
>> wouldn't be impossible.
>
> Good question. =A0I hadn't quite thought that far ahead. =A0I think the b=
asic
> answer is no (assume reliable multicast), but... some kind of healing
> mechanism would probably be required (see below).
>>
>> 2) When these CouchDB servers drop off for an extended period and then
>> rejoin, how do they subscribe to the update feed from the replication bu=
s at
>> a particular sequence? =A0This is really the key element of the setup. =
=A0When I
>> think of multicasting I think of video feeds and such, where if you drop=
 off
>> and rejoin you don't care about the old stuff you missed. =A0That's not =
the
>> case here. =A0Does the bus store all this old feed data?
>
> Think of something like RSS, but with distributed infrastructure.
> A node would publish an update to a specific address (e.g., like publishi=
ng
> an RSS feed).
>
> All nodes would subscribe to the feed, and receive new messages in sequen=
ce.
> =A0When picking up updates, you ask for everything after a particular seq=
uence
> number. =A0The update service maintains the data.
>>
>> 3) Which steps of the replication do you envision using the replication
>> bus? =A0Just the _changes feed (essentially a list of docid:rev pairs) o=
r the
>> actual documents themselves?
>
> Any change to a local copy of the database (i.e., everything).
>>
>> The conflict resolution model shouldn't care about whether replication i=
s
>> p2p or uses this bus. =A0Best,
>
> Thanks,
>
> Miles
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is. =A0 .... Yogi Berra
>
>
>