couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From svilen>
Subject Re: Selective Replication
Date Wed, 26 Sep 2012 09:50:09 GMT
wild guess, do u have loops in the "graph" or it is pure tree, i.e.
can the "named replication" repeat some documents if referenced many

apart of that, i guess one stream/replication of N docs would be
faster than N replications of 1 doc each, but i wonder why it
decellerates with N growing. btw the all-in-one would be starting
on empty, while the singles would be going to increasing target
each time.
can u try full-replicate 100000 all-in-one over to another 100000
docs? and how that measures ? if that gives any hint..


On Wed, 26 Sep 2012 17:34:50 +0200
Frank Wunderlich <> wrote:

> Hi *,
> I am currently trying to figure out, how one could realize something
> like "selective replication" in CouchDB.
> In our scenario we have got around 10 physically distributed CouchDB
> instances running. There will probably be more than 1 million
> documents in out "master" instance. Only a subset of those documents
> shall be replicated to each of the "slave" instances. Users shall be
> able to explicitly control, which documents shall be synchronized to
> which destination.
> So far I stumbled over the following 2 concepts:
> 1. Filtered Replication
> 2. Named Document Replication
> On the first glance, replication filters seemed to be the way to go.
> But unfortunately we have got a quite "relational" document model.
> One logical "asset" consists of several CouchDB documents,
> referencing each other.
> The filter functions can only access data, that is part of the
> document that is passed in as parameter. Because of this limitation,
> each partial document must contain all the information necessary, to
> determine whether it shall be replicated or not.
> This leads to redundancy and to potential inconsistencies if a
> "transaction" fails. Inconsistent asset aggregates might get
> "partially" transferred to other CouchDB instances. And in my eyes,
> it will be hard to recognize and track down the cause of such
> inconsistencies.
> Furthermore our content documents get "polluted" by pure technical
> attributes.
> That's why we took a look at the second option: Named Document
> Replication.
> It seemed to be good idea, to separate the two concerns of
> persistence and synchronization. First we would like to persist any
> "logical asset" in our local CouchDB. When we know that this step
> succeeded and all partial documents got stored in the database, then
> we would "register" the "logical asset" for synchronization. This
> step would happen on the application layer, that is built on top of
> our CouchDB.
> The registration process would look up all partial documents that
> make up the "logical asset". Then any running replication job would
> get canceled (assuming we are using continues replication). Finally
> we would restart those replication jobs by adding the indentified
> document_ids to the json that gets posted to the replicate URL.
> The first attempts seemed promising.
> But when experimenting with larger sets of documents, we noticed a
> significant performance degradation during replication. With 100.000
> documents to be replicated, the "Named Document Replication" was 4
> times slower than the complete and unconditional replication of the
> whole database. With 200.000 documents, the selective approach was
> even 7 times slower. With 1.000.000 documents, the factor was > 20
> So this approach is not scaling well...
> What are your thoughts about this?
> Is there anyone who has faced similar architectural questions? 
> Any hint will be appreciated.
> Best regards,
> Frank
> --
> kreuzwerker GmbH - we touch running systems
> fon  +49 177 8780280  | fax +49 30  6098388-99 
> Ritterstraße 12-14, 10969 Berlin |
> HR B 129427 | Amtsgericht Charlottenburg  |  Geschäftsführer: Tilmann
> Eing  

View raw message