incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Wunderlich <frank.wunderl...@kreuzwerker.de>
Subject Selective Replication
Date Wed, 26 Sep 2012 15:34:50 GMT
Hi *,

I am currently trying to figure out, how one could realize something like "selective replication"
in CouchDB.

In our scenario we have got around 10 physically distributed CouchDB instances running.
There will probably be more than 1 million documents in out "master" instance.
Only a subset of those documents shall be replicated to each of the "slave" instances.
Users shall be able to explicitly control, which documents shall be synchronized to which
destination.

So far I stumbled over the following 2 concepts:
1. Filtered Replication
2. Named Document Replication

On the first glance, replication filters seemed to be the way to go.
But unfortunately we have got a quite "relational" document model.
One logical "asset" consists of several CouchDB documents, referencing each other.

The filter functions can only access data, that is part of the document that is passed in
as parameter. 
Because of this limitation, each partial document must contain all the information necessary,
to determine whether it shall be replicated or not.

This leads to redundancy and to potential inconsistencies if a "transaction" fails. Inconsistent
asset aggregates might get "partially" transferred to other CouchDB instances.
And in my eyes, it will be hard to recognize and track down the cause of such inconsistencies.

Furthermore our content documents get "polluted" by pure technical attributes.


That's why we took a look at the second option: Named Document Replication.

It seemed to be good idea, to separate the two concerns of persistence and synchronization.
First we would like to persist any "logical asset" in our local CouchDB.
When we know that this step succeeded and all partial documents got stored in the database,
then we would "register" the "logical asset" for synchronization.
This step would happen on the application layer, that is built on top of our CouchDB.

The registration process would look up all partial documents that make up the "logical asset".
Then any running replication job would get canceled (assuming we are using continues replication).
Finally we would restart those replication jobs by adding the indentified document_ids to
the json that gets posted to the replicate URL.

The first attempts seemed promising.
But when experimenting with larger sets of documents, we noticed a significant performance
degradation during replication.
With 100.000 documents to be replicated, the "Named Document Replication" was 4 times slower
than the complete and unconditional replication of the whole database.
With 200.000 documents, the selective approach was even 7 times slower.
With 1.000.000 documents, the factor was > 20

So this approach is not scaling well...

What are your thoughts about this?
Is there anyone who has faced similar architectural questions? 

Any hint will be appreciated.
Best regards,
Frank



--
kreuzwerker GmbH - we touch running systems
fon  +49 177 8780280  | fax +49 30  6098388-99 
Ritterstraße 12-14, 10969 Berlin | frank.wunderlich@kreuzwerker.de
HR B 129427 | Amtsgericht Charlottenburg  |  Geschäftsführer: Tilmann Eing  


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message