Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 68877 invoked from network); 22 May 2009 16:27:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 May 2009 16:27:43 -0000 Received: (qmail 72275 invoked by uid 500); 22 May 2009 16:27:56 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 72186 invoked by uid 500); 22 May 2009 16:27:56 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 72176 invoked by uid 99); 22 May 2009 16:27:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 16:27:56 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=FS_REPLICA,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of randall.leeds@gmail.com designates 209.85.219.168 as permitted sender) Received: from [209.85.219.168] (HELO mail-ew0-f168.google.com) (209.85.219.168) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 May 2009 16:27:45 +0000 Received: by ewy12 with SMTP id 12so2265116ewy.11 for ; Fri, 22 May 2009 09:27:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=HJbtGKH/uXesFEglRd116ea1XBIa7qYdpot4T05ZrJw=; b=cF6N6A80rthOmj/4pK9fqxNud/SJsLPC3YorAGbSXSftIyP0KyT/iNHWE1g0nJ2nMh CiEkeapcJDCAsstMXNQZqZt+wDmhQscx2jkm9da3BHyh74lGKFJHdT1wnSyZS11yeaRX Q6kRBTivg9VuBA+NAOxFhgTgGbPH92qhy+awg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=ZHjjBccKYTQ2WnuI47Fs8k1Fznh/RRtNd59j6fn/SO8O5Nmk3JKhEYXEH5PbtwR5nC 5N8s1NmFBhQwsODPjaAWI904Tnp19KZCdQtOncf+qT/GxOyWcRz3UbhxP785sXqaLblj CkJcrC/btCI4oML0u6VMdPANfaY1Z6sLiJB7k= MIME-Version: 1.0 Received: by 10.216.47.213 with SMTP id t63mr813099web.134.1243009643300; Fri, 22 May 2009 09:27:23 -0700 (PDT) In-Reply-To: References: <067AA5E4-0E5F-46C7-85EE-FC9CBCF99490@apache.org> Date: Fri, 22 May 2009 12:27:23 -0400 Message-ID: Subject: Re: reiterating transactions vs. replication From: Randall Leeds To: dev@couchdb.apache.org Content-Type: multipart/alternative; boundary=001485f6d2a43dc6c8046a82bc76 X-Virus-Checked: Checked by ClamAV on apache.org --001485f6d2a43dc6c8046a82bc76 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On Fri, May 22, 2009 at 00:34, Paul Davis wrote: > >> > >> 1) Generate foo/1 and bar/1 in an atomic _bulk_docs operation > >> 2) Update foo -> foo/2 > >> Compact the DB (foo/1 is deleted) > >> Start replicating to a mirror > >> Replication crashes before it reaches foo/2 > > > > By crash you mean an error due to a conflict between foo/2 and foo/1' > > (the mirror's version of foo), right? > > > > Pretty sure he means the network link fails (or code fails, etc). What about "a node has been compromised" or "someone is spoofing messages from one of the nodes". These questions lead me to thinking about a plug-in component model for this work. I can imagine very different requirements with very different overhead even to just maintain the "normal" couchdb guarantees (putting aside any transactional _bulk_docs). > >> In your proposal, we should expect foo/1 to exist on the mirror, right? > I > >> think this means we'd need to modify the compaction algorithm to keep > >> revisions of documents if a) the revision was part of an atomic > _bulk_docs, > >> and b) any of the documents in that transaction are still at the > revision > >> generated by the transaction. Same thing goes for revision stemming -- > we > >> can never drop revisions if they were part of an atomic upload and at > least > >> one of the document revs in the upload is still current. > > > > In general, there were two main ideas that I saw when reading > literature on this subject: > > 1. Keep some sort of edit history > 2. If a replication event occurs, and the target node is too out of > date, then trigger a full database copy. > > At one point I suggested something like: > > 1. Keep some sort of edit history. > We already do this. And with revision stemming we already have a > configurable "How much history is kept" option. The fancy twist is > that we don't remove old revisions from the update sequence btree > until the revision stemming removes the revision info data. These > revisions are then replicated as part of normal replication. > > 2. In the case of replicating from a node that's too far out of whack, > instead of a full database copy, we just fall back to our current > replication scheme in that we lose all transaction guarantees (or the > guarantees that we can no longer guarantee, this is all quite hand > wavy). > > For point 1, I can see one of a few methods to deal with transactions. > Either we don't commit any of the docs in the transaction until they > all make it across the wire, or we just mark them as a conflict (with > maybe a 'in_transaction' modifier or some such). Keeping track of > revisions is pretty cake because all the documents would be sequential > in the update sequence btree. And it should also be easy to tell when > a transaction is so old that we no longer have all the data necessary > to make it work. To be clear, not all documents are sequential in the update tree unless we run some consensus protocol to decide the ordering or come up with some other vector clock type solution. I don't know much about the latter, but they've come up in discussions before on this list. I've thought about this a lot lately and I really like the techniques of Mencius [1] which runs a simplified Paxos that commits after only two communication rounds. There's a huge win if we want to allow re-ordering of commits. This is probably the case unless the application assumes some dependency between documents (frequently noted as a Bad Idea). Many servers can commit writes in one communication round. For example, a server can accept some other server's proposal for sequence number i and commit i+1 (assuming it is the leader for round i+1) even before it learns the result of the conensus for i as long as i+1 and i touch different documents. For point 2, I think we should make a distinction between inter-node replication and replication into and out of the clustered deployment in order to discuss it well. Inter-node migration of shards might rely on replication, but if this is the case it should probably be triggered and managed "from above". In other words, it might involve passing around a "view id" which increments on any shard migration as well and having nodes propose "view changes" to the update sequence consensus when shards migrate. When a view change proposal is accapted, replication starts. Only when replication ends does the rest of the group "learn" the new mapping. If the nodes cooperate on changing to a new view with a new shard-to-node mapping I don't think there should ever be conflicts caused by replication. Some thought needs to go into the other scenario (replicating in/out with the cluster viewed as a single CouchDB instance), but something tells me if we get globally ordered seq numbers it's trivial. > > > > However, in another post Damien said: > > > >> Which is why in general you want to avoid inter-document dependencies, > >> or be relaxed in how you deal with them. > See my point above. > Though, the more times I end up writing out long responses to how we > might do replication and the requirements and this and that the more > likely I'll be to just tag any and all replication emails with "will > only discuss working code". Judging from date stamps in that thread, > its been four months and not one person has offered even a > broken-almost-but-not-quite-working patch. In the words of Damien's > blog tagline, "Everybody keeps on talking about it. Nobody's getting > it done". > Since I do like the component model, I'm planning to set up a github project to play with some consensus protocols and overlay networks in Erlang. Hopefully once I start doing that I'll start to see the places that CouchDB can hook into it and get a nice, clean, flexible API. I see the problem broken into several tiers. Transactional Bulk Docs (this is the wishlist and challenge, but has to rest on the below) Sharding/Replication (_seq consensus / possibly consistent hashing or other distributed, deterministic data structure mapping BTree nodes to servers [2]) Communication (either Erlang or a tcp with pluggable overlay-network for routing) In the long term I'd love to see CouchDB be able to handle all kinds of deployments no one is doing right now. For example, with one component stack you might be able to run CouchDB in a peer-to-peer overlay with replication and automatic migration, byzantine fault tolerant consensus, and an application which signs documents in such a way that they can be verified on reads. Why? Who knows. You dream it. However, this argument leads me to think that an "atomic" flag might be ok. If your deployment doesn't support it, just send back and error code. I think it'd be appropriate to amend that to: If anyone wants this > feature, then start sending code. We're all happy to help introduce > people to the code base if guidance is required, but enough time has > gone by that its hard to seriously consider proposals with no concrete > realization. I have graduation ceremonies this weekend. Afterward, I hope to follow through on this advice and invite anyone to join me. So, keep the discussion flowing and I think we'll get some code flowing sooner rather than later. [1] Mencius: Building Efficient Replicated State Machines for WANs http://www.sysnet.ucsd.edu/~yamao/pub/mencius-osdi.pdf --001485f6d2a43dc6c8046a82bc76--