couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: idea for transitive replication checkpoints
Date Fri, 18 Feb 2011 19:33:47 GMT
On Feb 18, 2011, at 11:16 AM, Paul Davis wrote:

> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <> wrote:
>> On Fri, Feb 18, 2011 at 1:23 AM, Paul Davis <> wrote:
>>> On Fri, Feb 18, 2011 at 1:15 AM, Paul Davis <>
>>>> On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <>
>>>>> Hi all, Paul and I were chatting at today's CouchDB Hack Night about
a way to fast-forward replications (thanks Max for the prodding!).  It's non-trivial, but
I think the benefit for big networks of CouchDB servers can be substantial.
>>>>> The basic idea is that if A replicates with B, and B with C, then a new
replication between A and C should not need to start from scratch.  I think we can accomplish
this as follows:
>>>>> 1) Store the target update sequence along with the source sequence in
the checkpoint document, at least in the checkpoint document on the target.  The following
tuple is important: {Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax
let's say we have the following replication records:
>>>>> On A
>>>>> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence
on the source
>>>>> On B
>>>>> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
>>>>> {B, _local/Baz, Bif, 15, _TargetSeq}
>>>>> On C
>>>>> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
>>>>> We know that A -> B happened before B -> C.
>>>>> 2) During the B -> C replication, when we reach source sequence number
10, the _changes feed from B will deliver some extra information like
>>>>> {A, _local/Foo, Bar, 5}
>>>>> which will be stored at C. This may require a new disk-resident btree
keyed on update sequence, or at least an in-memory index constructed by walking the _local
docs btree.
>>>>> 3) When we trigger the A -> C replication, C will walk the full checkpoint
records in its _local tree and find no mention of A, but then it will also consult the "transitive"
checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A, find
that the session ID Bar is still present, and conclude that it can fast-forward the replication
and start from update sequence 5.  It will then remove that transitive checkpoint and replace
it with a full regular checkpoint.
>>>>> If server A crashes after the A -> B replication and restores from
a backup that was recorded before the replication, the session ID Bar will be missing from
_local/Foo, so when we try to do the A -> replication we won't fast forward.  This is the
correct behavior.
>>>>> Hopefully this is comprehensible to someone other than me.  We spent
some time trying to poke holes in it, but it's entirely possible there are other things we
didn't consider that will prevent it from working.  Cheers,
>>>>> Adam
>>>> What Adam said. Also, I was just doing a brain dump and I think I
>>>> might've punched a gaping whole into the whole scenario. I'm not
>>>> entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
>>>> towards the end where my brain dump froze up. Its late so maybe I'm
>>>> just not seeing the easy way around it.
>>>> There's also a picture of the end of our white board session at
>>>> which probably means little to nothing
>>>> without the context of having seen it drawn and the ensuing argument
>>>> and wild gesticulations. But its there for posterity.
>>>> <brain_dump>
>>>> Transitive Replication - The Idea
>>>> =================================
>>>> Consider the following scenario:
>>>> 1. Replicate A -> B
>>>> 2. Replicate B -> C
>>>> 3. Replicate A -> C
>>>> For simplicity's sake, assume no writes occur during this scenario. The
>>>> question is why can't we short circuit step 3 to effectively be a no-op?
>>>> Current Situation
>>>> =================
>>>> Replication state is based on a pair-wise state reflecting source and
>>>> target information (and filter functions etc). For the above scenario to
>>>> be anywhere near plausible a couple things need to happen. First, we'll
>>>> obviously need to transfer data from B -> C during replication so it
>>>> has knowledge about A. This information will have to be complete enough
>>>> to short circuit (or skip part of) a replication from A.
>>>> The information that B sends to C will need to enable a replication from
>>>> A to C to occur without error in any sort of pathological state of A
>>>> irregardless of what state C thinks A is in. Changes in state may include
>>>> A "forgetting" some edits and resetting to a point in time the state
>>>> that C has (for instance, A crashed and was recovered to a previous
>>>> point in time).
>>>> C will also need to be able to uniquely identify A regardless of host or
>>>> other transitory characteristics.
>>>> An Old Proposition
>>>> ==================
>>>> There's been a proposal floated a few times for a few different reasons
>>>> to give each database a UUID so that it is uniquely identifiable for
>>>> various reasons (ETags come to mind). Such a UUID were it to exist would
>>>> allow us to uniquely identify a database in the above scenario.
>>>> The first issue with db UUID's that always pops up is that we have to
>>>> address the case of what happens when someone copies a database (perhaps
>>>> to short circuit an initial replication, or restoring a db when a
>>>> machine fails) is that the UUID may no longer be globally unique.
>>>> This would need to be fixed for transitive replication to have any
>>>> chance of working. One solution that was mentioned was to have each
>>>> CouchDB node remember all UUID's that it knows about and if a db is
>>>> opened with an unknown UUID, that db gets a new UUID assigned.
>>>> This could be accomplished efficiently by storing _local docs in the
>>>> replicator database that reference known UUID/dbname pairs. Then we
>>>> just lookup the UUID on db open and if it doesn't match the db name
>>>> we reset it.
>>>> For upgrade compatibility and the ability to change UUID's often we
>>>> could just store the UUID in the db header (as opposed to the first
>>>> sixteen bytes of the file).
>>>> Information Propagation Requirements
>>>> ====================================
>>>> When replication occurs we need to inform the target database of a
>>>> few pieces of information so that it knows about transitive replications
>>>> that it contains. We also need to make sure that the target db doesn't
>>>> learn about this information before it contains the entire replica set
>>>> and it needs to be processed in such a way that it doesn't require
>>>> complete replications.
>>>> These requirements pretty much lead us to the fact that the replica
>>>> state will need to be beamed across as the target receives information
>>>> from the source update sequence. Ie, when we iterate the _changes feed
>>>> we get extra info when we've arrived an update_seq that wholly contains
>>>> some prior replication from an arbitrary node to the *source*.
>>>> Information to Propagate
>>>> ========================
>>>> Now we need to consider what information needs to exist on a db in
>>>> order to figure out if we *can* short circuit a replication as well as
>>>> where we fast forward *to*.
>>>> One obvious piece of information is the UUID of the database stream. A
>>>> second piece would be the update_seq for that UUID. After some thought
>>>> we also realize we need to store some more information to check if that
>>>> UUID-update_seq pair is still valid when we go to fast-forward.
>>>> The case that could invalidate a pair is if a database crashes and it
>>>> needs to be restored. Consider if A replicates to B replicates to C. C
>>>> has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
>>>> thought experiment. Now at some point after C learns of A, A crashes and
>>>> is restored from backup. Now A is at update_seq 5. Now we go on with
>>>> our business and write 5 docs to A. But we also write 5 *different* docs
>>>> than we wrote before the restore. This divergence in history would not
>>>> be detectable without extra information.
>>>> After much hand waving about rolling hashes, Adam decided to remember
>>>> that we store a replication history between two db's. This can be
>>>> represented as a _local doc id that includes information on the pair
>>>> of db's as well as a random session id. If we include this data with
>>>> the UUID-update_seq pair, when we check if a short circuit is possible
>>>> we can check that this record still exists.
>>>> In the case of the crash/restore even if we go and make the same edits
>>>> and even have a similar replication history, the randomness to the
>>>> session id will inform us that something has gone awry and we need to
>>>> run a full replication to make sure we have all history.
>>>> Information Required to Trigger Propagation
>>>> ===========================================
>>>> Along with the four pieces of information mentioned above, we also need
>>>> to store what update_seq in the target database was the *result* of a
>>>> replication. Ie, when we replicate A -> B, B needs to know the final
>>>> update_seq of that replication transaction. This is so that when B
>>>> replicates to C, it knows when to tell C about A. We can't do this at the
>>>> very beginning because the replication might fail before all of the
>>>> info from A is replicated. We also can't wait until the end because then
>>>> C may never learn of A because of failure.
>>>> This means that we need to know for a given update_seq if after it has
>>>> been replicated, C can suddenly fast-forward a replication with someone
>>>> other than B. To do this B will need to be able to stream its update
>>>> sequence and efficiently check if that completes some replication record
>>>> that C should know about.
>>>> We might quickly assume that storing this in the existing update seq
>>>> b+tree would be kosher, but it isn't. Consider the case where update_seq
>>>> 6 on B is the end of the replication A -> B. Now consider that B starts
>>>> replicating to C while someone starts updating the doc for update_seq
>>>> 6 on B. Its possible that a series of events could lead to C never
>>>> learning of A because the update_seq for the doc id from 6 keeps jumping
>>>> to the latest update_seq.
>>>> The proper way to fix this would be to insert code that says "when an
>>>> update_seq entry is updated, move its replication info to the next update
>>>> seq" which sounds like it could get really quite wonky.
>>>> So the solution would be to have some sort of indexed structure of
>>>> replication records that can be scanned to know when to send out some
>>>> replication finished....
>>>> Ruh Roh
>>>> =======
>>>> I just realized something wonky with this whole plan. We *don't*
>>>> necessarily know when a replication ends because of update sequences. For
>>>> instance, if we replicate A -> B, and then edit a doc from A on B, and
>>>> replicate B -> C, can we ever know when to short circuit a replication?
>>>> This could be a huge gaping whole. Someone prove me wrong.
>>>> Storing Replication State
>>>> =========================
>>>> With this new piece of information we'll also require some way to store
>>>> replication state. This should hopefully be hand-wavy trivial by just
>>>> storing replication records in _local docs very similarly to how they're
>>>> currently stored.
>>>> </brain_dump>
>>> The important point of my ruh roh to realize that I failed to
>>> articulate, the reason that this is bad is that if when we edit the
>>> doc on B before replication to C, C *can't* know what's on A until it
>>> gets to the new version of the doc in B. This coupled with the fact
>>> that we can edit anything on B, and that they all jump to the end
>>> makes me think that we'd have to do some more extensive bookkeeping to
>>> make sure that C doesn't know about B until after all of A's docs get
>>> pushed.
>>> Blargghhh....
>> Doesn't know about A until all of A's docs get pushed. Its late. I'm out.
> After sleeping on it, I think that this doesn't shoot the whole idea
> out of the sky, but requires us to only send the info when a
> replication manages to reach the end of the update_seq btree in a
> single db snapshot. I'm not sure if that means that it'd be out of the
> question for continuous replication or not.

Hi Paul, thanks for this articulate writeup.  I think you're correct in this last email, we
can only send these extra bits of information about other replications whenever we've reached
the end of an MVCC snapshot from the current source.  That shouldn't be a problem for continuous
replication, since under the hood it's implemented as a loop of "open / walk seq_tree / wait
for new updates" calls.  We can just send any new transitive checkpoints that we encountered
during the current walk just before going into the "wait for new updates" step.

View raw message