couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan du Fresne <ste...@medicmobile.org>
Subject Re: The state of filtered replication
Date Wed, 25 May 2016 10:39:49 GMT
Hi Pedro,

Thanks for your advice.

This is definitely something that is in the back of our minds, along with looking into couchdb
clustering. Another similar option we’re considering is having filtered replication between
those replicas and having them represent regions (our data permission structure is basically
report <- person <- family <- region <- larger region <- still larger region).
This would still involve filtered replication, but would cut down on irrelevant documents
that users had to filter through. We’re still at the stage of trying to get the most out
of one server however. 

On your example though, to be clear, assigning users to replicas is something that I have
to manage myself, correct? Do you know if a particular user needs to stays on the same replica
or if I could just dumbly direct them to any existing node? Naively I’d think that I could
do the latter, but I’ve noticed one-way replication seems to involve passing some metadata
back to the server (Pouch does this, though I’ve never really looked into what it’s sending
or what Couch does it with.), so it’s not clear how stateful this kind of thing is.

Cheers,
Stefan

> On 25 May 2016, at 09:51, Pedro Narciso García Revington <p.revington@gmail.com>
wrote:
> 
> Because couchdb supports master master replication you can alter your
> schema to:
> 
> master couchdb → couchdb replica 1 → some clients
>                               couchdb replica 2 → some other clients
> 
> So you can distrubute the load between the replicas.
> 
> 2016-05-25 10:34 GMT+02:00 Stefan du Fresne <stefan@medicmobile.org <mailto:stefan@medicmobile.org>>:
> 
>> Hello all,
>> 
>> I work on an app that involves a large amount of CouchDB filtered
>> replication (every user has a filtered subset of the DB locally via
>> PouchDB). Currently filtered replication is our number 1 performance
>> bottleneck for rolling out to more users, and I'm trying to work out where
>> we can go from here.
>> 
>> Our current setup is one CouchDB database and N PouchDB installations,
>> which all two-way replicate, with the CouchDB->PouchDB replication being
>> filtered based on user permissions / relevance [1].
>> 
>> Our issue is that as we add users a) total document creation velocity
>> increases, and b) the proportion of documents that are relevant to any
>> particular user decreases. These two points cause replication-- both
>> initial onboarding and continual-- to take longer and longer.
>> 
>> At this stage we are being forced to manually limit the number of users we
>> onboard at any particular time to half a dozen or so, or risk CouchDB being
>> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
>> time due to how we're rolling pit, you can imagine that this is pretty
>> painful.
>> 
>> I have already re-written the filter in Erlang, which halved its execution
>> time, which is awesome!
>> 
>> I also attempted to simplify the filter to increase performance. However,
>> filter speed seems more dependent on the physical size of your filter as
>> opposed to what code executes, which makes writing a simple filter that can
>> fall-back to a complicated filter not terribly useful (see:
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>>)
>> 
>> If the above linked ticket is fixed (if it can be) this would make our
>> filter 3-4x faster again. However, this still wouldn't address the
>> fundamental issue that filtered replication is very CPU-intensive, and so
>> as noted above doesn't seem to scale terribly well.
>> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
>> 
>> Does anyone know of anything else I can try to increase replication
>> performance? Or to safeguard against many replicators unacceptably
>> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>> 
>> Thanks in advance,
>> - Stefan du Fresne
>> 
>> [1] security is handled by not exposing couch and going through a wrapper
>> service that validates couch requests, relevance is hierarchy based (i.e.
>> documents you or your subordinates are authors of are replicated to you)
>> [2] there are also administrators / configurers that access couchdb
>> directly


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message