couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan du Fresne <ste...@medicmobile.org>
Subject Re: The state of filtered replication
Date Wed, 25 May 2016 10:45:00 GMT
Hi Simon,

It’s good to hear we are not the only people to have this problem, and that we’re following
in the footsteps of others :-)

More databases is definitely an option, but it’s one we’re trying to avoid, partially
to keep the budget under control, and partially because we’d be constantly scaling up and
down, as most of our performance concerns are just when we’re on boarding, so it’s a lot
of extra complexity.

Unfortunately (2) isn’t an option for us either: our PouchDB clients are on slow phones
with slow, flakey and expensive network connections (think health workers in remote parts
of Uganda) and so reducing what we send them to the bare minimum is very important. We also
shouldn’t really send them other people’s data— even to have Pouch then filter it out—
for privacy reasons.

Stefan
> On 25 May 2016, at 09:55, Sinan Gabel <sinan.gabel@gmail.com> wrote:
> 
> Hi Stefan,
> 
> I recognise your description and problem: I also gave up on the server-side
> performance. With 1.6.1 version of CouchDB I only saw two immediate options:
> 
> (1) More databases on the server-side to reduce the number of docs per
> database
> (2) Simply do the filtering on the client-side in PouchDB, this is actually
> quite fast and robust: Here experiment with best settings of options:
> *batch_size* and *timeout*.
> 
> For (2) possibly combine with: https://github.com/nolanlawson/worker-pouch <https://github.com/nolanlawson/worker-pouch>
> if there are a lot of documents
> 
> 
> ... however it would be best with a much faster "production-made"
> server-side filtering opportunity in CouchDB 2.x.
> 
> 
> Br,
> Sinan
> 
> On 25 May 2016 at 10:34, Stefan du Fresne <stefan@medicmobile.org <mailto:stefan@medicmobile.org>>
wrote:
> 
>> Hello all,
>> 
>> I work on an app that involves a large amount of CouchDB filtered
>> replication (every user has a filtered subset of the DB locally via
>> PouchDB). Currently filtered replication is our number 1 performance
>> bottleneck for rolling out to more users, and I'm trying to work out where
>> we can go from here.
>> 
>> Our current setup is one CouchDB database and N PouchDB installations,
>> which all two-way replicate, with the CouchDB->PouchDB replication being
>> filtered based on user permissions / relevance [1].
>> 
>> Our issue is that as we add users a) total document creation velocity
>> increases, and b) the proportion of documents that are relevant to any
>> particular user decreases. These two points cause replication-- both
>> initial onboarding and continual-- to take longer and longer.
>> 
>> At this stage we are being forced to manually limit the number of users we
>> onboard at any particular time to half a dozen or so, or risk CouchDB being
>> unresponsive [2]. As we'd want to be onboarding 50-100 at any particular
>> time due to how we're rolling pit, you can imagine that this is pretty
>> painful.
>> 
>> I have already re-written the filter in Erlang, which halved its execution
>> time, which is awesome!
>> 
>> I also attempted to simplify the filter to increase performance. However,
>> filter speed seems more dependent on the physical size of your filter as
>> opposed to what code executes, which makes writing a simple filter that can
>> fall-back to a complicated filter not terribly useful (see:
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <
>> https://issues.apache.org/jira/browse/COUCHDB-3021 <https://issues.apache.org/jira/browse/COUCHDB-3021>>)
>> 
>> If the above linked ticket is fixed (if it can be) this would make our
>> filter 3-4x faster again. However, this still wouldn't address the
>> fundamental issue that filtered replication is very CPU-intensive, and so
>> as noted above doesn't seem to scale terribly well.
>> 
>> Ideally then, I would like to remove filter replication completely, but
>> there does not seem to be a good alternative right now.
>> 
>> Looking through the archives there was talk of adding view replication,
>> see:
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>
>> <
>> https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E
<https://mail-archives.apache.org/mod_mbox/couchdb-user/201307.mbox/%3CCAJNb-9pK4CVRHNwr83_DXCn%2B2_CZXgwDzbK3m_G2pdfWjSsFMA%40mail.gmail.com%3E>>
>> , but it doesn't look like this ever got resolved.
>> 
>> There is also often talk of databases per user being a good scaling
>> strategy, but we're basically doing that already (with PouchDB),  and for
>> us documents aren't owned / viewed by just one person so this does not get
>> us away from filtered replication (eg a supervisor replicates her documents
>> as well as N sub-users documents). There are potentially wild and crazy
>> schemes that involves many different databases where the equivalent of
>> filtering is expressed in replication relationships, but this would add a
>> massive amount of complexity to our app, and I’m not even convinced it
>> would work as there are lots of edge cases to consider.
>> 
>> Does anyone know of anything else I can try to increase replication
>> performance? Or to safeguard against many replicators unacceptably
>> degrading couchdb performance? Does Couch 2.0 address any of these concerns?
>> 
>> Thanks in advance,
>> - Stefan du Fresne
>> 
>> [1] security is handled by not exposing couch and going through a wrapper
>> service that validates couch requests, relevance is hierarchy based (i.e.
>> documents you or your subordinates are authors of are replicated to you)
>> [2] there are also administrators / configurers that access couchdb
>> directly


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message