couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan du Fresne <ste...@medicmobile.org>
Subject The state of filtered replication
Date Wed, 25 May 2016 08:34:52 GMT
Hello all,

I work on an app that involves a large amount of CouchDB filtered replication (every user
has a filtered subset of the DB locally via PouchDB). Currently filtered replication is our
number 1 performance bottleneck for rolling out to more users, and I'm trying to work out
where we can go from here.

Our current setup is one CouchDB database and N PouchDB installations, which all two-way replicate,
with the CouchDB->PouchDB replication being filtered based on user permissions / relevance
[1].

Our issue is that as we add users a) total document creation velocity increases, and b) the
proportion of documents that are relevant to any particular user decreases. These two points
cause replication-- both initial onboarding and continual-- to take longer and longer.

At this stage we are being forced to manually limit the number of users we onboard at any
particular time to half a dozen or so, or risk CouchDB being unresponsive [2]. As we'd want
to be onboarding 50-100 at any particular time due to how we're rolling pit, you can imagine
that this is pretty painful.

I have already re-written the filter in Erlang, which halved its execution time, which is
awesome!

I also attempted to simplify the filter to increase performance. However, filter speed seems
more dependent on the physical size of your filter as opposed to what code executes, which
makes writing a simple filter that can fall-back to a complicated filter not terribly useful
(see: https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>)

If the above linked ticket is fixed (if it can be) this would make our filter 3-4x faster
again. However, this still wouldn't address the fundamental issue that filtered replication
is very CPU-intensive, and so as noted above doesn't seem to scale terribly well.

Ideally then, I would like to remove filter replication completely, but there does not seem
to be a good alternative right now.

Looking through the archives there was talk of adding view replication, see: https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
, but it doesn't look like this ever got resolved.

There is also often talk of databases per user being a good scaling strategy, but we're basically
doing that already (with PouchDB),  and for us documents aren't owned / viewed by just one
person so this does not get us away from filtered replication (eg a supervisor replicates
her documents as well as N sub-users documents). There are potentially wild and crazy schemes
that involves many different databases where the equivalent of filtering is expressed in replication
relationships, but this would add a massive amount of complexity to our app, and I’m not
even convinced it would work as there are lots of edge cases to consider.

Does anyone know of anything else I can try to increase replication performance? Or to safeguard
against many replicators unacceptably degrading couchdb performance? Does Couch 2.0 address
any of these concerns?

Thanks in advance,
- Stefan du Fresne

[1] security is handled by not exposing couch and going through a wrapper service that validates
couch requests, relevance is hierarchy based (i.e. documents you or your subordinates are
authors of are replicated to you)
[2] there are also administrators / configurers that access couchdb directly
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message