incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Hankins <timchank...@gmail.com>
Subject Re: Refactoring a CouchDB.
Date Fri, 18 Jan 2013 11:42:44 GMT
Thanks all! Your suggestions are extremely valuable. I really appreciate
the help!

If I find that I need a little more clarification, I'll be sure to ask, but
for now I'm going to roll up my sleeves, and get to work.

Cheers,
Tim.


On Thu, Jan 17, 2013 at 2:32 PM, Dave Cottlehuber <dch@jsonified.com> wrote:

> On 17 January 2013 13:08, Tim Hankins <timchankins@gmail.com> wrote:
> > Hi,
> >
> > I'm a student programmer at the IT University of Copenhagen, and have
> > inherited a CouchDB application, which I'm having trouble scaling. I
> > believe it may need to be refactored.
> >
> > Specifically, the problem seems to be coming from the use of Filtered
> > Replication. (All user documents are stored in the same database, and
> > replicating from server to client requires filtered replication.)
>
> Yes, this seems very likely.
>
> The main constraint is that replication filters need to be run per
> document, per replication. So N replications requires N passes through
> all the documents, ie N^2. And in your case most of the documents will
> not be replicated to a given user.
>
> There are 2 main approaches, and a possible 3rd one:
>
> 1. keep using the same replication approach, but create an additional
> server-side per-user DB. Move all endpoints to access their private DB
> only, and non-filtered replication is now possible. Use a master DB
> that replicates all docs from the private DB to the master DB, for
> views across all users, and implement an additional replication on the
> server-side private DBs to retrieve their data.
>
> 2. use named document replication to transfer only the required
> documents from the master DB to the endpoint DB. To identify what
> documents need to be transferred per user, create a view that exposes
> only the user's name. You can therefore avoid the N^2 filter pass
> above, as it will be done once within the view.
>
> http://wiki.apache.org/couchdb/Replication#Named_Document_Replication
>
> You can probably use the update_seq (available in the json properties
> of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
> up to, but I can think of a few corner cases where this might come
> back and haunt you later. Anybody else want to comment?
>
> 3. Do something (handwavey) with the _changes feed to ensure each
> document only needs to be processed/filtered once, and use this to
> avoid an on-disk view. Somewhere you'll need a list of documents then
> that have been sent to a specific client, so I'm not sure that 1 or 2
> aren't already better. But maybe your specific use case can work with
> this more easily.
>
> > Each day, the android client application collects two kinds of data.
> > Subjective and Objective. Subjective data are manually entered by
> patients.
> > Objective data are gathered from the phone's sensors.
> >
> > Subjective and Objective data are stored in their own couch documents,
> and
> > have IDs that include the user's ID, the document type, and the date in a
> > "DD_MM_YYYY" format. They are replicated once a day by placing
> replication
> > docs in the "_replicator" database.
>
> That sounds like overkill for a single document id. Ideally you keep
> your doc ids short (as they're used everywhere as , well, ids) and put
> the extra info into separate fields within the document. You can
> easily create a view to reconstruct that same data format from the
> document if absolutely required.
>
> > Once replicated to the server, these documents are...
> >     1). Used as input to a data mining algorithm.
> >     2). Displayed on a web page. (Users can see their own data, and
> > clinicians can see the data for all users.)
> >
> > The data mining algorithm produces a new CouchDB document for each user
> > every day, which we call an "Impact Factor" document. (It looks at each
> > user's historical objective and subjective data, and looks for
> > correlations.)
>
> Cool! This sounds impressive.
>
> > Replication: Replication takes place from client to server, and from
> server
> > to client.
> >     1). Client to server: This seems to be working fine.
> >     2). Server to client: This is what's broken.
> >
> > Two things have to be replicated from server to client.
> >     1). Each user's subjective data for the past 14 days.
> >     2). Each user's Impact Factor document for the current day.
> >
> > Since all user documents are stored in the same database, we use filtered
> > replication to send the right docs to the right users.
>
> From a privacy perspective, I would always use a per-user server-side
> DB as the replication endpoint.
>
> A Wise Man once said "always plan that all your data gets replicated,
> everywhere". It only takes one slipup to share confidential
> information across patients/customers.
>
> > The problem is that this filter function takes too long. ( >10minutes)
> >     1). To test whether the filter function is crashing, I replicated the
> > entire DB to another un-loaded machine, and it seems to run just fine.
> > (Well it takes about 2.5 minutes, but it doesn't crash.)
> >     2). I've tried re-writing the filter function in ERLANG, but haven't
> > managed to get it working.
> >
> > And besides, I suspect that the way the DB is structured is just not
> suited
> > to the job.
> >
> > So, to summarize...
> >     - Android client phones produce new CouchDB docs and replicate them
> to
> > the server.
> >     - One central CouchDB holds all users.
> >     - Both individual and group data are served to web pages.
> >     - A data mining algorithm processes this data on a per-user basis.
> >     - Subjective data and Impact Factor data documents are replicated
> from
> > the server to each client phone.
> >
> > Is there a way to structure the DB so that users can replicate without
> the
> > need for filters, but which preserves the ability of clinicians to see an
> > overview of all users? (It's my understanding that views can't be run *
> > across* databases.)
>
> In summary, either turn the N^2 filter problem into a O(N)
> pre-calculated view, or use a per-user DB.
>
> And ideally do both, if disk & other constraints are feasible.
>
> > Well, as before, any suggestions or pointers would be much appreciated.
> >
> > Cheers,
> > Tim.
>
> A+
> Dave
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message