couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Blakey <>
Subject Replication and forms of weak consistency
Date Mon, 16 Feb 2009 02:30:30 GMT
I've recently been considering replication models, and looking at  
relevant prior art. I'd like to start a discussion about the best  
replication model for CouchDB, in the hope of garnering both support  
and help in implementing a replication model that provides a stronger  
form of weak consistency under replication that CouchDB currently  
provides. This can be done without sacrificing any of the pre- 
determined goals of CouchDB.

There are two research streams that I've been following. The first is  
Bayou, for which this: is a  
good entry point. Bayou is somewhat more powerful than CouchDB because  
it provides consistency guarantees while reading from groups of  

The second is PRACTI, which starts here:

. The interesting thing about PRACTI from my point of view is how it  
extends weak-consistency to partial replication.

There's also an interesting set of papers here:

, although most of them aren't directly applicable.

Firstly though, it's worth considering the characteristics of  
CouchDB's current replication system.

The background to this issue is the CAP dilemma, described and  
analysed here:

The PRACTI paper summarizes this as "the CAP dilemma states that a  
replication system that provides sequential Consistency cannot  
simultaneously provide 100% Availability in an environment that can be  

CouchDB is a virtually-always-partitioned system that provides 100%  
availability (at any given node). Nodes themselves are not partition  
tolerant, and hence can provide arbitrary consistency guarantees,  
including sequential Consistency as represented by serializable  
transactions. It is intended however that CouchDB provide a cluster  
architecture. Although the only extant suggestion for this presumes  
partition tolerant clustering ( 
), this is but one model of a cluster architecture. I would argue that  
this is little more than a load-balancing proxy, and that there are  
alternative cluster architectures that provide significant benefits,  
although this may be begging the question.

For the purposes of initial discussion, the cluster issue isn't  
relevant, although it is an issue when considering isolated write  
sequences, which are roughly analgous to Bayou's sessions, and are a  
very useful replacement for traditional ACID transactions.

The key issue is that there are forms of consistency that, while less  
than 'sequential consistency' i.e. distributed transactions, are still  
useful. Specifically, Bayou provides the following:

1. Read Your Writes - read operations reflect previous writes.
2. Monotonic Reads - successive reads reflect a non-decreasing set of  
3. Writes Follow Reads - writes are propagated after reads on which  
they depend.
4. Monotonic Writes - writes are propagated after writes that  
logically precede them.

Monotonic Writes, sometimes called write-ordering, is the specific  
form of weak-consistency that interests me in the context of CouchDB.

Consider two documents, A and B, with write versions indicated by  
numeric suffixes e.g. A0, A1, A2 etc. A local application makes a  
series of writes:

   [ A0, B0, A1 ]

Couch currently replicates this as

   [ A0-, B0, A1 ]

where A0- indicates that the document is replication without it's  
data. The replicator chooses not to provide the data for A0-, only  
noting that the revision exists. If the database is compacted however,  
then the replicator no longer has any choice - the data for A0 no  
longer exists.

It might seem that this doesn't matter, but because replication isn't  
atomic, the replication target can, at any time and for any length of  
time (possibly permanently) see an arbitrary prefix of the replication  
stream, such as this:

   [ B0 ]

As far as I can tell, it won't see A0- until it sees A1, although this  
doesn't affect this discussion. The point is that the target doesn't  
see writes in the order that they occur in the source, and state- 
consistency is only reached when the replication reaches the source  
write-point, which, ignoring the read/write ratio, is by no means  
guaranteed in an unreliable environment.

To make this more concrete - imagine that A is a blog post and B is a  
comment. It's possible for running code to see a comment without a  
blog post. This isn't the end of the world in this example, but it  
does complicate applications which use this data, and unnecessarily,  
as Bayou and PRACTI show. In the face of failure, either temporary  
(comms) or permanent node failure, the target will see a view of the  
source that possibly cannot be made write-order consistent. Write- 
order consistency is a very useful and greatly simplifying feature for  

This situation is exacerbated by per-document revision stemming.


One the surface, the simplest solution to this is to retain and  
replicate each revision of a document, in MVCC commit order. The  
result of this is that every intermediate state that the target sees  
during replication is consistent with the write ordering in the  
source. Incremental replication this maintains write-order  
consistency, even in the face of failure.

An obvious optimisation is that this: [ ... A0, A1, A2 ... ] can be  
replicated as this [ ... A2 ... ] because the intermediate writes  
aren't relevant, although see my caveat.

If you allow for *local* multi-document ACID commits then you can  
significantly optimise replication, with the added advantage of being  
able to provide a weak-consistency equivalent to transactions. The  
idea is that you can group writes into an isolation group e.g.

   [ ... A1, B3, C2 .... ]

Concurrent access on the local node cannot see any intermediate state  
e.g. the three writes are ACID. Note that the 'C' in ACID doesn't mean  
that the write will fail if there are conflicts - you can choose for  
that to be the case on a per-group-write basis on the local node, but  
when it's replicated you don't have that choice - it will commit  
regardless. The key property here is really Isolation, rather than  

It's not difficult to replication such isolation groups - you simply  
wrap the writes in a start/end group in the replication stream, and  
replication uses local ACID with commit-on-conflict semantics. If the  
replication stream doesn't see the end group marker because of comms  
failure, then the group isn't written.

This allows the replication process itself to be adaptively optimised  
even if such isolation groups aren't exposed to the user. Consider a  
replication stream:

   [ ... A1, B3, C2, A2, B4, A3 ... ]

This can be replicated as:

   [ ... { C2, B3-, B4, A1-, A2-, A3 } ... ]


   [ ... { C2, B4, A3 } ... ]

where { } delimit isolation groups. Once again though, see the caveat.

Finally, the existing compact and proposed revision stemming are  
incompatible with write-ordering consistency. Bayou uses a simple  
point-in-time truncation of the history e.g. linear in the db, and  
when it gets a replication request that requires more history that it  
has, it synchronizes the entire database. This is an issue for  
availability because the target needs to be locked while the missing  
history prefix is synchronised to ensure that the target doesn't see  
an inconsistent write-ordering.



The reason the above is tentative, is that it only considers two  
peers. Multiple peers can have write dependencies caused by multiple  
replications between arbitrary peers. I haven't thought through that  
yet. This paper has some information of this issue in a slightly more  
challenging context:


And that's as far as my thinking has progressed. Write-order  
consistency in the face of partial replication introduces some new  

Antony Blakey
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The fact that an opinion has been widely held is no evidence whatever  
that it is not utterly absurd.
   -- Bertrand Russell

View raw message