incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Cottlehuber <...@jsonified.com>
Subject Re: Refactoring a CouchDB.
Date Thu, 17 Jan 2013 13:32:23 GMT
On 17 January 2013 13:08, Tim Hankins <timchankins@gmail.com> wrote:
> Hi,
>
> I'm a student programmer at the IT University of Copenhagen, and have
> inherited a CouchDB application, which I'm having trouble scaling. I
> believe it may need to be refactored.
>
> Specifically, the problem seems to be coming from the use of Filtered
> Replication. (All user documents are stored in the same database, and
> replicating from server to client requires filtered replication.)

Yes, this seems very likely.

The main constraint is that replication filters need to be run per
document, per replication. So N replications requires N passes through
all the documents, ie N^2. And in your case most of the documents will
not be replicated to a given user.

There are 2 main approaches, and a possible 3rd one:

1. keep using the same replication approach, but create an additional
server-side per-user DB. Move all endpoints to access their private DB
only, and non-filtered replication is now possible. Use a master DB
that replicates all docs from the private DB to the master DB, for
views across all users, and implement an additional replication on the
server-side private DBs to retrieve their data.

2. use named document replication to transfer only the required
documents from the master DB to the endpoint DB. To identify what
documents need to be transferred per user, create a view that exposes
only the user's name. You can therefore avoid the N^2 filter pass
above, as it will be done once within the view.

http://wiki.apache.org/couchdb/Replication#Named_Document_Replication

You can probably use the update_seq (available in the json properties
of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
up to, but I can think of a few corner cases where this might come
back and haunt you later. Anybody else want to comment?

3. Do something (handwavey) with the _changes feed to ensure each
document only needs to be processed/filtered once, and use this to
avoid an on-disk view. Somewhere you'll need a list of documents then
that have been sent to a specific client, so I'm not sure that 1 or 2
aren't already better. But maybe your specific use case can work with
this more easily.

> Each day, the android client application collects two kinds of data.
> Subjective and Objective. Subjective data are manually entered by patients.
> Objective data are gathered from the phone's sensors.
>
> Subjective and Objective data are stored in their own couch documents, and
> have IDs that include the user's ID, the document type, and the date in a
> "DD_MM_YYYY" format. They are replicated once a day by placing replication
> docs in the "_replicator" database.

That sounds like overkill for a single document id. Ideally you keep
your doc ids short (as they're used everywhere as , well, ids) and put
the extra info into separate fields within the document. You can
easily create a view to reconstruct that same data format from the
document if absolutely required.

> Once replicated to the server, these documents are...
>     1). Used as input to a data mining algorithm.
>     2). Displayed on a web page. (Users can see their own data, and
> clinicians can see the data for all users.)
>
> The data mining algorithm produces a new CouchDB document for each user
> every day, which we call an "Impact Factor" document. (It looks at each
> user's historical objective and subjective data, and looks for
> correlations.)

Cool! This sounds impressive.

> Replication: Replication takes place from client to server, and from server
> to client.
>     1). Client to server: This seems to be working fine.
>     2). Server to client: This is what's broken.
>
> Two things have to be replicated from server to client.
>     1). Each user's subjective data for the past 14 days.
>     2). Each user's Impact Factor document for the current day.
>
> Since all user documents are stored in the same database, we use filtered
> replication to send the right docs to the right users.

>From a privacy perspective, I would always use a per-user server-side
DB as the replication endpoint.

A Wise Man once said "always plan that all your data gets replicated,
everywhere". It only takes one slipup to share confidential
information across patients/customers.

> The problem is that this filter function takes too long. ( >10minutes)
>     1). To test whether the filter function is crashing, I replicated the
> entire DB to another un-loaded machine, and it seems to run just fine.
> (Well it takes about 2.5 minutes, but it doesn't crash.)
>     2). I've tried re-writing the filter function in ERLANG, but haven't
> managed to get it working.
>
> And besides, I suspect that the way the DB is structured is just not suited
> to the job.
>
> So, to summarize...
>     - Android client phones produce new CouchDB docs and replicate them to
> the server.
>     - One central CouchDB holds all users.
>     - Both individual and group data are served to web pages.
>     - A data mining algorithm processes this data on a per-user basis.
>     - Subjective data and Impact Factor data documents are replicated from
> the server to each client phone.
>
> Is there a way to structure the DB so that users can replicate without the
> need for filters, but which preserves the ability of clinicians to see an
> overview of all users? (It's my understanding that views can't be run *
> across* databases.)

In summary, either turn the N^2 filter problem into a O(N)
pre-calculated view, or use a per-user DB.

And ideally do both, if disk & other constraints are feasible.

> Well, as before, any suggestions or pointers would be much appreciated.
>
> Cheers,
> Tim.

A+
Dave

Mime
View raw message