Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 38931 invoked from network); 18 Feb 2011 06:24:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Feb 2011 06:24:24 -0000 Received: (qmail 77442 invoked by uid 500); 18 Feb 2011 06:24:24 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 77290 invoked by uid 500); 18 Feb 2011 06:24:21 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 77282 invoked by uid 99); 18 Feb 2011 06:24:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 06:24:20 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=FREEMAIL_FROM,FS_REPLICA,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 74.125.83.52 as permitted sender) Received: from [74.125.83.52] (HELO mail-gw0-f52.google.com) (74.125.83.52) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Feb 2011 06:24:15 +0000 Received: by gwb11 with SMTP id 11so1520282gwb.11 for ; Thu, 17 Feb 2011 22:23:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=wqf371RdY1t40rQfAIzEARJJHwqyKryc7UPb3+o9Oj8=; b=AQ3gw495ARI4kiwZY/89utTtCz665Wal/TU9pnVR+gNJVAeTGQws4SxLuw01mIEeuL /fhyiDIH87FJF8Yc3/xYE1Mq42rrUH8rnNJr0eFQ8VJFRLP3Q268RYlXAjc0jNfZcMWo QZ+HLJhhpqlmWxfBpepm3q7ndSyYnFOtS5T4k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=Y/JKGgdopgvHl38il21PPcouXIUg3I+0gjZp0QBULoPjUBS6HyCEuh1EUuzcHiuaFN B3N2F+vwyRBWyvITr30k+gsSa+7nIxXErGQA+RO75fKLUGp1qibd2VsFipqr3z8JQzOq xLtfJTkuEhp05G3qfbqFulJ2ZkbfrsD8cYdV8= Received: by 10.150.219.17 with SMTP id r17mr594024ybg.113.1298010234165; Thu, 17 Feb 2011 22:23:54 -0800 (PST) MIME-Version: 1.0 Received: by 10.147.40.17 with HTTP; Thu, 17 Feb 2011 22:23:14 -0800 (PST) In-Reply-To: References: <80DB49FF-6C0D-42A7-A0EC-59AD8B0431A8@apache.org> From: Paul Davis Date: Fri, 18 Feb 2011 01:23:14 -0500 Message-ID: Subject: Re: idea for transitive replication checkpoints To: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis w= rote: > On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski wro= te: >> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a w= ay to fast-forward replications (thanks Max for the prodding!). =A0It's non= -trivial, but I think the benefit for big networks of CouchDB servers can b= e substantial. >> >> The basic idea is that if A replicates with B, and B with C, then a new = replication between A and C should not need to start from scratch. =A0I thi= nk we can accomplish this as follows: >> >> 1) Store the target update sequence along with the source sequence in th= e checkpoint document, at least in the checkpoint document on the target. = =A0The following tuple is important: {Source, _local ID, Session ID, Source= Seq, TargetSeq}. =A0Using that syntax let's say we have the following repli= cation records: >> >> On A >> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence = on the source >> >> On B >> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B >> {B, _local/Baz, Bif, 15, _TargetSeq} >> >> On C >> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C >> >> We know that A -> B happened before B -> C. >> >> 2) During the B -> C replication, when we reach source sequence number 1= 0, the _changes feed from B will deliver some extra information like >> >> {A, _local/Foo, Bar, 5} >> >> which will be stored at C. This may require a new disk-resident btree ke= yed on update sequence, or at least an in-memory index constructed by walki= ng the _local docs btree. >> >> 3) When we trigger the A -> C replication, C will walk the full checkpoi= nt records in its _local tree and find no mention of A, but then it will al= so consult the "transitive" checkpoints and find the {A, _local/Foo, Bar, 5= } record. =A0It'll consult _local/Foo on A, find that the session ID Bar is= still present, and conclude that it can fast-forward the replication and s= tart from update sequence 5. =A0It will then remove that transitive checkpo= int and replace it with a full regular checkpoint. >> >> If server A crashes after the A -> B replication and restores from a bac= kup that was recorded before the replication, the session ID Bar will be mi= ssing from _local/Foo, so when we try to do the A -> replication we won't f= ast forward. =A0This is the correct behavior. >> >> Hopefully this is comprehensible to someone other than me. =A0We spent s= ome time trying to poke holes in it, but it's entirely possible there are o= ther things we didn't consider that will prevent it from working. =A0Cheers= , >> >> Adam > > What Adam said. Also, I was just doing a brain dump and I think I > might've punched a gaping whole into the whole scenario. I'm not > entirely certain yet, but it seems ungood. There's a section "Ruh Roh" > towards the end where my brain dump froze up. Its late so maybe I'm > just not seeing the easy way around it. > > There's also a picture of the end of our white board session at > http://plixi.com/p/78268064 which probably means little to nothing > without the context of having seen it drawn and the ensuing argument > and wild gesticulations. But its there for posterity. > > > > Transitive Replication - The Idea > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D > > Consider the following scenario: > > 1. Replicate A -> B > 2. Replicate B -> C > 3. Replicate A -> C > > For simplicity's sake, assume no writes occur during this scenario. The > question is why can't we short circuit step 3 to effectively be a no-op? > > Current Situation > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Replication state is based on a pair-wise state reflecting source and > target information (and filter functions etc). For the above scenario to > be anywhere near plausible a couple things need to happen. First, we'll > obviously need to transfer data from B -> C during replication so it > has knowledge about A. This information will have to be complete enough > to short circuit (or skip part of) a replication from A. > > The information that B sends to C will need to enable a replication from > A to C to occur without error in any sort of pathological state of A > irregardless of what state C thinks A is in. Changes in state may include > A "forgetting" some edits and resetting to a point in time the state > that C has (for instance, A crashed and was recovered to a previous > point in time). > > C will also need to be able to uniquely identify A regardless of host or > other transitory characteristics. > > An Old Proposition > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > There's been a proposal floated a few times for a few different reasons > to give each database a UUID so that it is uniquely identifiable for > various reasons (ETags come to mind). Such a UUID were it to exist would > allow us to uniquely identify a database in the above scenario. > > The first issue with db UUID's that always pops up is that we have to > address the case of what happens when someone copies a database (perhaps > to short circuit an initial replication, or restoring a db when a > machine fails) is that the UUID may no longer be globally unique. > > This would need to be fixed for transitive replication to have any > chance of working. One solution that was mentioned was to have each > CouchDB node remember all UUID's that it knows about and if a db is > opened with an unknown UUID, that db gets a new UUID assigned. > > This could be accomplished efficiently by storing _local docs in the > replicator database that reference known UUID/dbname pairs. Then we > just lookup the UUID on db open and if it doesn't match the db name > we reset it. > > For upgrade compatibility and the ability to change UUID's often we > could just store the UUID in the db header (as opposed to the first > sixteen bytes of the file). > > Information Propagation Requirements > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > When replication occurs we need to inform the target database of a > few pieces of information so that it knows about transitive replications > that it contains. We also need to make sure that the target db doesn't > learn about this information before it contains the entire replica set > and it needs to be processed in such a way that it doesn't require > complete replications. > > These requirements pretty much lead us to the fact that the replica > state will need to be beamed across as the target receives information > from the source update sequence. Ie, when we iterate the _changes feed > we get extra info when we've arrived an update_seq that wholly contains > some prior replication from an arbitrary node to the *source*. > > Information to Propagate > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Now we need to consider what information needs to exist on a db in > order to figure out if we *can* short circuit a replication as well as > where we fast forward *to*. > > One obvious piece of information is the UUID of the database stream. A > second piece would be the update_seq for that UUID. After some thought > we also realize we need to store some more information to check if that > UUID-update_seq pair is still valid when we go to fast-forward. > > The case that could invalidate a pair is if a database crashes and it > needs to be restored. Consider if A replicates to B replicates to C. C > has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this > thought experiment. Now at some point after C learns of A, A crashes and > is restored from backup. Now A is at update_seq 5. Now we go on with > our business and write 5 docs to A. But we also write 5 *different* docs > than we wrote before the restore. This divergence in history would not > be detectable without extra information. > > After much hand waving about rolling hashes, Adam decided to remember > that we store a replication history between two db's. This can be > represented as a _local doc id that includes information on the pair > of db's as well as a random session id. If we include this data with > the UUID-update_seq pair, when we check if a short circuit is possible > we can check that this record still exists. > > In the case of the crash/restore even if we go and make the same edits > and even have a similar replication history, the randomness to the > session id will inform us that something has gone awry and we need to > run a full replication to make sure we have all history. > > > Information Required to Trigger Propagation > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Along with the four pieces of information mentioned above, we also need > to store what update_seq in the target database was the *result* of a > replication. Ie, when we replicate A -> B, B needs to know the final > update_seq of that replication transaction. This is so that when B > replicates to C, it knows when to tell C about A. We can't do this at the > very beginning because the replication might fail before all of the > info from A is replicated. We also can't wait until the end because then > C may never learn of A because of failure. > > This means that we need to know for a given update_seq if after it has > been replicated, C can suddenly fast-forward a replication with someone > other than B. To do this B will need to be able to stream its update > sequence and efficiently check if that completes some replication record > that C should know about. > > We might quickly assume that storing this in the existing update seq > b+tree would be kosher, but it isn't. Consider the case where update_seq > 6 on B is the end of the replication A -> B. Now consider that B starts > replicating to C while someone starts updating the doc for update_seq > 6 on B. Its possible that a series of events could lead to C never > learning of A because the update_seq for the doc id from 6 keeps jumping > to the latest update_seq. > > The proper way to fix this would be to insert code that says "when an > update_seq entry is updated, move its replication info to the next update > seq" which sounds like it could get really quite wonky. > > So the solution would be to have some sort of indexed structure of > replication records that can be scanned to know when to send out some > replication finished.... > > Ruh Roh > =3D=3D=3D=3D=3D=3D=3D > > I just realized something wonky with this whole plan. We *don't* > necessarily know when a replication ends because of update sequences. For > instance, if we replicate A -> B, and then edit a doc from A on B, and th= en > replicate B -> C, can we ever know when to short circuit a replication? > > This could be a huge gaping whole. Someone prove me wrong. > > Storing Replication State > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > With this new piece of information we'll also require some way to store > replication state. This should hopefully be hand-wavy trivial by just > storing replication records in _local docs very similarly to how they're > currently stored. > > > The important point of my ruh roh to realize that I failed to articulate, the reason that this is bad is that if when we edit the doc on B before replication to C, C *can't* know what's on A until it gets to the new version of the doc in B. This coupled with the fact that we can edit anything on B, and that they all jump to the end makes me think that we'd have to do some more extensive bookkeeping to make sure that C doesn't know about B until after all of A's docs get pushed. Blargghhh....