Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of randall.leeds@gmail.com
 designates 209.85.219.168 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=ZHjjBccKYTQ2WnuI47Fs8k1Fznh/RRtNd59j6fn/SO8O5Nmk3JKhEYXEH5PbtwR5nC
         5N8s1NmFBhQwsODPjaAWI904Tnp19KZCdQtOncf+qT/GxOyWcRz3UbhxP785sXqaLblj
         CkJcrC/btCI4oML0u6VMdPANfaY1Z6sLiJB7k=
MIME-Version: 1.0
In-Reply-To: <e2111bbb0905212134p2503e841o5cad5637f4b76d31@mail.gmail.com>
References: <a891e1bd0905210400n34981084t426fb3b55ca18ac6@mail.gmail.com>
	 <067AA5E4-0E5F-46C7-85EE-FC9CBCF99490@apache.org>
	 <a891e1bd0905212030g67001a1akd02bf85a6aaa4ad9@mail.gmail.com>
	 <e2111bbb0905212134p2503e841o5cad5637f4b76d31@mail.gmail.com>
Date: Fri, 22 May 2009 12:27:23 -0400
Message-ID: <ec7a93a0905220927o695b68c0ha7b90e4f5ac01bd0@mail.gmail.com>
Subject: Re: reiterating transactions vs. replication
From: Randall Leeds <randall.leeds@gmail.com>
To: dev@couchdb.apache.org
Content-Type: multipart/alternative; boundary=001485f6d2a43dc6c8046a82bc76

--001485f6d2a43dc6c8046a82bc76
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On Fri, May 22, 2009 at 00:34, Paul Davis <paul.joseph.davis@gmail.com>wrote:

> >>
> >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation
> >> 2) Update foo -> foo/2
> >> Compact the DB (foo/1 is deleted)
> >> Start replicating to a mirror
> >> Replication crashes before it reaches foo/2
> >
> > By crash you mean an error due to a conflict between foo/2 and foo/1'
> > (the mirror's version of foo), right?
> >
>
> Pretty sure he means the network link fails (or code fails, etc).


What about "a node has been compromised" or "someone is spoofing messages
from one of the nodes". These questions lead me to thinking about a plug-in
component model for this work. I can imagine very different requirements
with very different overhead even to just maintain the "normal" couchdb
guarantees (putting aside any transactional _bulk_docs).


> >> In your proposal, we should expect foo/1 to exist on the mirror, right?
>  I
> >> think this means we'd need to modify the compaction algorithm to keep
> >> revisions of documents if a) the revision was part of an atomic
> _bulk_docs,
> >> and b) any of the documents in that transaction are still at the
> revision
> >> generated by the transaction.  Same thing goes for revision stemming --
> we
> >> can never drop revisions if they were part of an atomic upload and at
> least
> >> one of the document revs in the upload is still current.
> >
>
> In general, there were two main ideas that I saw when reading
> literature on this subject:
>
> 1. Keep some sort of edit history
> 2. If a replication event occurs, and the target node is too out of
> date, then trigger a full database copy.
>
> At one point I suggested something like:
>
> 1. Keep some sort of edit history.
>    We already do this. And with revision stemming we already have a
> configurable "How much history is kept" option. The fancy twist is
> that we don't remove old revisions from the update sequence btree
> until the revision stemming removes the revision info data. These
> revisions are then replicated as part of normal replication.
>
> 2. In the case of replicating from a node that's too far out of whack,
> instead of a full database copy, we just fall back to our current
> replication scheme in that we lose all transaction guarantees (or the
> guarantees that we can no longer guarantee, this is all quite hand
> wavy).
>
> For point 1, I can see one of a few methods to deal with transactions.
> Either we don't commit any of the docs in the transaction until they
> all make it across the wire, or we just mark them as a conflict (with
> maybe a 'in_transaction' modifier or some such). Keeping track of
> revisions is pretty cake because all the documents would be sequential
> in the update sequence btree. And it should also be easy to tell when
> a transaction is so old that we no longer have all the data necessary
> to make it work.


To be clear, not all documents are sequential in the update tree unless we
run some consensus protocol to decide the ordering or come up with some
other vector clock type solution. I don't know much about the latter, but
they've come up in discussions before on this list. I've thought about this
a lot lately and I really like the techniques of Mencius [1] which runs a
simplified Paxos that commits after only two communication rounds.

There's a huge win if we want to allow re-ordering of commits. This is
probably the case unless the application assumes some dependency between
documents (frequently noted as a Bad Idea). Many servers can commit writes
in one communication round. For example, a server can accept some other
server's proposal for sequence number i and commit i+1 (assuming it is the
leader for round i+1) even before it learns the result of the conensus for i
as long as i+1 and i touch different documents.

For point 2, I think we should make a distinction between inter-node
replication and replication into and out of the clustered deployment in
order to discuss it well. Inter-node migration of shards might rely on
replication, but if this is the case it should probably be triggered and
managed "from above". In other words, it might involve passing around a
"view id" which increments on any shard migration as well and having nodes
propose "view changes" to the update sequence consensus when shards migrate.
When a view change proposal is accapted, replication starts. Only when
replication ends does the rest of the group "learn" the new mapping. If the
nodes cooperate on changing to a new view with a new shard-to-node mapping I
don't think there should ever be conflicts caused by replication.

Some thought needs to go into the other scenario (replicating in/out with
the cluster viewed as a single CouchDB instance), but something tells me if
we get globally ordered seq numbers it's trivial.


> >
> > However, in another post Damien said:
> >
> >> Which is why in general you want to avoid inter-document dependencies,
> >> or be relaxed in how you deal with them.
>

See my point above.


> Though, the more times I end up writing out long responses to how we
> might do replication and the requirements and this and that the more
> likely I'll be to just tag any and all replication emails with "will
> only discuss working code". Judging from date stamps in that thread,
> its been four months and not one person has offered even a
> broken-almost-but-not-quite-working patch. In the words of Damien's
> blog tagline, "Everybody keeps on talking about it. Nobody's getting
> it done".
>

Since I do like the component model, I'm planning to set up a github project
to play with some consensus protocols and overlay networks in Erlang.
Hopefully once I start doing that I'll start to see the places that CouchDB
can hook into it and get a nice, clean, flexible API. I see the problem
broken into several tiers.

Transactional Bulk Docs (this is the wishlist and challenge, but has to rest
on the below)
Sharding/Replication (_seq consensus / possibly consistent hashing or other
distributed, deterministic data structure mapping BTree nodes to servers
[2])
Communication (either Erlang or a tcp with pluggable overlay-network for
routing)

In the long term I'd love to see CouchDB be able to handle all kinds of
deployments no one is doing right now. For example, with one component stack
you might be able to run CouchDB in a peer-to-peer overlay with replication
and automatic migration, byzantine fault tolerant consensus, and an
application which signs documents in such a way that they can be verified on
reads. Why? Who knows. You dream it. However, this argument leads me to
think that an "atomic" flag might be ok. If your deployment doesn't support
it, just send back and error code.

I think it'd be appropriate to amend that to: If anyone wants this
> feature, then start sending code. We're all happy to help introduce
> people to the code base if guidance is required, but enough time has
> gone by that its hard to seriously consider proposals with no concrete
> realization.


I have graduation ceremonies this weekend. Afterward, I hope to follow
through on this advice and invite anyone to join me. So, keep the discussion
flowing and I think we'll get some code flowing sooner rather than later.

[1] Mencius: Building Efficient Replicated State Machines for WANs
http://www.sysnet.ucsd.edu/~yamao/pub/mencius-osdi.pdf

--001485f6d2a43dc6c8046a82bc76--