couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: idea for transitive replication checkpoints
Date Fri, 18 Feb 2011 06:15:33 GMT
On Thu, Feb 17, 2011 at 9:45 PM, Adam Kocoloski <> wrote:
> Hi all, Paul and I were chatting at today's CouchDB Hack Night about a way to fast-forward
replications (thanks Max for the prodding!).  It's non-trivial, but I think the benefit for
big networks of CouchDB servers can be substantial.
> The basic idea is that if A replicates with B, and B with C, then a new replication between
A and C should not need to start from scratch.  I think we can accomplish this as follows:
> 1) Store the target update sequence along with the source sequence in the checkpoint
document, at least in the checkpoint document on the target.  The following tuple is important:
{Source, _local ID, Session ID, SourceSeq, TargetSeq}.  Using that syntax let's say we have
the following replication records:
> On A
> {A, _local/Foo, Bar, 5, _TargetSeq} % we could omit the target sequence on the source
> On B
> {A, _local/Foo, Bar, 5, 10} % 5 on A corresponds to 10 on B
> {B, _local/Baz, Bif, 15, _TargetSeq}
> On C
> {B, _local/Baz, Bif, 15, 7} % 15 on B corresponds to 7 on C
> We know that A -> B happened before B -> C.
> 2) During the B -> C replication, when we reach source sequence number 10, the _changes
feed from B will deliver some extra information like
> {A, _local/Foo, Bar, 5}
> which will be stored at C. This may require a new disk-resident btree keyed on update
sequence, or at least an in-memory index constructed by walking the _local docs btree.
> 3) When we trigger the A -> C replication, C will walk the full checkpoint records
in its _local tree and find no mention of A, but then it will also consult the "transitive"
checkpoints and find the {A, _local/Foo, Bar, 5} record.  It'll consult _local/Foo on A,
find that the session ID Bar is still present, and conclude that it can fast-forward the replication
and start from update sequence 5.  It will then remove that transitive checkpoint and replace
it with a full regular checkpoint.
> If server A crashes after the A -> B replication and restores from a backup that was
recorded before the replication, the session ID Bar will be missing from _local/Foo, so when
we try to do the A -> replication we won't fast forward.  This is the correct behavior.
> Hopefully this is comprehensible to someone other than me.  We spent some time trying
to poke holes in it, but it's entirely possible there are other things we didn't consider
that will prevent it from working.  Cheers,
> Adam

What Adam said. Also, I was just doing a brain dump and I think I
might've punched a gaping whole into the whole scenario. I'm not
entirely certain yet, but it seems ungood. There's a section "Ruh Roh"
towards the end where my brain dump froze up. Its late so maybe I'm
just not seeing the easy way around it.

There's also a picture of the end of our white board session at which probably means little to nothing
without the context of having seen it drawn and the ensuing argument
and wild gesticulations. But its there for posterity.


Transitive Replication - The Idea

Consider the following scenario:

1. Replicate A -> B
2. Replicate B -> C
3. Replicate A -> C

For simplicity's sake, assume no writes occur during this scenario. The
question is why can't we short circuit step 3 to effectively be a no-op?

Current Situation

Replication state is based on a pair-wise state reflecting source and
target information (and filter functions etc). For the above scenario to
be anywhere near plausible a couple things need to happen. First, we'll
obviously need to transfer data from B -> C during replication so it
has knowledge about A. This information will have to be complete enough
to short circuit (or skip part of) a replication from A.

The information that B sends to C will need to enable a replication from
A to C to occur without error in any sort of pathological state of A
irregardless of what state C thinks A is in. Changes in state may include
A "forgetting" some edits and resetting to a point in time the state
that C has (for instance, A crashed and was recovered to a previous
point in time).

C will also need to be able to uniquely identify A regardless of host or
other transitory characteristics.

An Old Proposition

There's been a proposal floated a few times for a few different reasons
to give each database a UUID so that it is uniquely identifiable for
various reasons (ETags come to mind). Such a UUID were it to exist would
allow us to uniquely identify a database in the above scenario.

The first issue with db UUID's that always pops up is that we have to
address the case of what happens when someone copies a database (perhaps
to short circuit an initial replication, or restoring a db when a
machine fails) is that the UUID may no longer be globally unique.

This would need to be fixed for transitive replication to have any
chance of working. One solution that was mentioned was to have each
CouchDB node remember all UUID's that it knows about and if a db is
opened with an unknown UUID, that db gets a new UUID assigned.

This could be accomplished efficiently by storing _local docs in the
replicator database that reference known UUID/dbname pairs. Then we
just lookup the UUID on db open and if it doesn't match the db name
we reset it.

For upgrade compatibility and the ability to change UUID's often we
could just store the UUID in the db header (as opposed to the first
sixteen bytes of the file).

Information Propagation Requirements

When replication occurs we need to inform the target database of a
few pieces of information so that it knows about transitive replications
that it contains. We also need to make sure that the target db doesn't
learn about this information before it contains the entire replica set
and it needs to be processed in such a way that it doesn't require
complete replications.

These requirements pretty much lead us to the fact that the replica
state will need to be beamed across as the target receives information
from the source update sequence. Ie, when we iterate the _changes feed
we get extra info when we've arrived an update_seq that wholly contains
some prior replication from an arbitrary node to the *source*.

Information to Propagate

Now we need to consider what information needs to exist on a db in
order to figure out if we *can* short circuit a replication as well as
where we fast forward *to*.

One obvious piece of information is the UUID of the database stream. A
second piece would be the update_seq for that UUID. After some thought
we also realize we need to store some more information to check if that
UUID-update_seq pair is still valid when we go to fast-forward.

The case that could invalidate a pair is if a database crashes and it
needs to be restored. Consider if A replicates to B replicates to C. C
has a state {A-UUID, A-update_seq}. Say A-update_seq is 10 for this
thought experiment. Now at some point after C learns of A, A crashes and
is restored from backup. Now A is at update_seq 5. Now we go on with
our business and write 5 docs to A. But we also write 5 *different* docs
than we wrote before the restore. This divergence in history would not
be detectable without extra information.

After much hand waving about rolling hashes, Adam decided to remember
that we store a replication history between two db's. This can be
represented as a _local doc id that includes information on the pair
of db's as well as a random session id. If we include this data with
the UUID-update_seq pair, when we check if a short circuit is possible
we can check that this record still exists.

In the case of the crash/restore even if we go and make the same edits
and even have a similar replication history, the randomness to the
session id will inform us that something has gone awry and we need to
run a full replication to make sure we have all history.

Information Required to Trigger Propagation

Along with the four pieces of information mentioned above, we also need
to store what update_seq in the target database was the *result* of a
replication. Ie, when we replicate A -> B, B needs to know the final
update_seq of that replication transaction. This is so that when B
replicates to C, it knows when to tell C about A. We can't do this at the
very beginning because the replication might fail before all of the
info from A is replicated. We also can't wait until the end because then
C may never learn of A because of failure.

This means that we need to know for a given update_seq if after it has
been replicated, C can suddenly fast-forward a replication with someone
other than B. To do this B will need to be able to stream its update
sequence and efficiently check if that completes some replication record
that C should know about.

We might quickly assume that storing this in the existing update seq
b+tree would be kosher, but it isn't. Consider the case where update_seq
6 on B is the end of the replication A -> B. Now consider that B starts
replicating to C while someone starts updating the doc for update_seq
6 on B. Its possible that a series of events could lead to C never
learning of A because the update_seq for the doc id from 6 keeps jumping
to the latest update_seq.

The proper way to fix this would be to insert code that says "when an
update_seq entry is updated, move its replication info to the next update
seq" which sounds like it could get really quite wonky.

So the solution would be to have some sort of indexed structure of
replication records that can be scanned to know when to send out some
replication finished....

Ruh Roh

I just realized something wonky with this whole plan. We *don't*
necessarily know when a replication ends because of update sequences. For
instance, if we replicate A -> B, and then edit a doc from A on B, and then
replicate B -> C, can we ever know when to short circuit a replication?

This could be a huge gaping whole. Someone prove me wrong.

Storing Replication State

With this new piece of information we'll also require some way to store
replication state. This should hopefully be hand-wavy trivial by just
storing replication records in _local docs very similarly to how they're
currently stored.


View raw message