couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: The replicator needs a superuser mode
Date Wed, 17 Aug 2011 02:49:16 GMT
On Aug 16, 2011, at 10:31 PM, Jason Smith wrote:

> On Tue, Aug 16, 2011 at 9:26 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
>> One of the principal uses of the replicator is to "make this database look like that
one".  We're unable to do that in the general case today because of the combination of validation
functions and out-of-order document transfers.  It's entirely possible for a document to be
saved in the source DB prior to the installation of a ddoc containing a validation function
that would have rejected the document, for the replicator to install the ddoc in the target
DB before replicating the other document, and for the other document to then be rejected by
the target DB.
> 
> Somebody asked about this on Stack Overflow. It was a very simple but
> challenging question, but now I can't find it. Basically, he made your
> point above.
> 
> Aren't you identifying two problems, though?
> 
> 1. Sometimes you need to ignore validation to just make a nice, clean copy.
> 2. Replication batches (an optimization) are disobeying the change
> sequence, which can screw up the replica.

As far as I know the only reason one needs to ignore validation to make a nice clean copy
is because the replicator does not guarantee the updates are applied on the target in the
order they were received on the source.  It's all one issue to me.

> I responded to #1 already.
> 
> But my feeling about #2 is that the optimization goes too far.
> replication batches should always have boundaries immediately before
> and after design documents. In other words, batch all you want, but
> design documents [1] must always be in a batch size of 1. That will
> retain the semantics.
> 
> [1] Actually, the only ddocs needing their own private batches are
> those with a validate_doc_update field.

My standard retort to transaction boundaries is that there is no global ordering of events
in a distributed system.  A clustered CouchDB can try to build a vector clock out of the change
sequences of the individual servers and stick to that merged sequence during replication,
but even then the ddoc entry in the feed could be "concurrent" with several other updates.
 I rather like that the replicator aggressively mixes up the ordering of updates because it
prevents us from making choices in the single-server case that aren't sensible in a cluster.

By the way, I don't consider this line of discussion presumptuous in the least.  Cheers,

Adam


Mime
View raw message