Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAL+Y1nvAERH6umQfpi=WWGJDJAaHwASa02H3s=5YthnJuzbHRQ@mail.gmail.com>
References: 
 <CAHfPQSntOq-9YXszvectbbNrx5Nzgd_9_7Ajs=UHsQaqkdDX3w@mail.gmail.com>
	<CAL+Y1nvAERH6umQfpi=WWGJDJAaHwASa02H3s=5YthnJuzbHRQ@mail.gmail.com>
Date: Thu, 17 Jan 2013 08:41:11 -0500
Message-ID: 
 <CABvT1DHLKyoDZvQN0YKNHAgjc9eEnJ5zMfJy5jkqKX0jpdfY=w@mail.gmail.com>
Subject: Re: Refactoring a CouchDB.
From: Robert Newson <rnewson@apache.org>
To: "user@couchdb.apache.org" <user@couchdb.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Just a reminder that filtered replication still benefits from
checkpointing (i.e, it's incremental as usual). The performance
difference of unfiltered vs filtered replication is the cost of the
evaluation through couchjs.


On 17 January 2013 08:32, Dave Cottlehuber <dch@jsonified.com> wrote:
> On 17 January 2013 13:08, Tim Hankins <timchankins@gmail.com> wrote:
>> Hi,
>>
>> I'm a student programmer at the IT University of Copenhagen, and have
>> inherited a CouchDB application, which I'm having trouble scaling. I
>> believe it may need to be refactored.
>>
>> Specifically, the problem seems to be coming from the use of Filtered
>> Replication. (All user documents are stored in the same database, and
>> replicating from server to client requires filtered replication.)
>
> Yes, this seems very likely.
>
> The main constraint is that replication filters need to be run per
> document, per replication. So N replications requires N passes through
> all the documents, ie N^2. And in your case most of the documents will
> not be replicated to a given user.
>
> There are 2 main approaches, and a possible 3rd one:
>
> 1. keep using the same replication approach, but create an additional
> server-side per-user DB. Move all endpoints to access their private DB
> only, and non-filtered replication is now possible. Use a master DB
> that replicates all docs from the private DB to the master DB, for
> views across all users, and implement an additional replication on the
> server-side private DBs to retrieve their data.
>
> 2. use named document replication to transfer only the required
> documents from the master DB to the endpoint DB. To identify what
> documents need to be transferred per user, create a view that exposes
> only the user's name. You can therefore avoid the N^2 filter pass
> above, as it will be done once within the view.
>
> http://wiki.apache.org/couchdb/Replication#Named_Document_Replication
>
> You can probably use the update_seq (available in the json properties
> of http://couch:5984/dbname/ ) as a checkpoint of where you are / were
> up to, but I can think of a few corner cases where this might come
> back and haunt you later. Anybody else want to comment?
>
> 3. Do something (handwavey) with the _changes feed to ensure each
> document only needs to be processed/filtered once, and use this to
> avoid an on-disk view. Somewhere you'll need a list of documents then
> that have been sent to a specific client, so I'm not sure that 1 or 2
> aren't already better. But maybe your specific use case can work with
> this more easily.
>
>> Each day, the android client application collects two kinds of data.
>> Subjective and Objective. Subjective data are manually entered by patients.
>> Objective data are gathered from the phone's sensors.
>>
>> Subjective and Objective data are stored in their own couch documents, and
>> have IDs that include the user's ID, the document type, and the date in a
>> "DD_MM_YYYY" format. They are replicated once a day by placing replication
>> docs in the "_replicator" database.
>
> That sounds like overkill for a single document id. Ideally you keep
> your doc ids short (as they're used everywhere as , well, ids) and put
> the extra info into separate fields within the document. You can
> easily create a view to reconstruct that same data format from the
> document if absolutely required.
>
>> Once replicated to the server, these documents are...
>>     1). Used as input to a data mining algorithm.
>>     2). Displayed on a web page. (Users can see their own data, and
>> clinicians can see the data for all users.)
>>
>> The data mining algorithm produces a new CouchDB document for each user
>> every day, which we call an "Impact Factor" document. (It looks at each
>> user's historical objective and subjective data, and looks for
>> correlations.)
>
> Cool! This sounds impressive.
>
>> Replication: Replication takes place from client to server, and from server
>> to client.
>>     1). Client to server: This seems to be working fine.
>>     2). Server to client: This is what's broken.
>>
>> Two things have to be replicated from server to client.
>>     1). Each user's subjective data for the past 14 days.
>>     2). Each user's Impact Factor document for the current day.
>>
>> Since all user documents are stored in the same database, we use filtered
>> replication to send the right docs to the right users.
>
> From a privacy perspective, I would always use a per-user server-side
> DB as the replication endpoint.
>
> A Wise Man once said "always plan that all your data gets replicated,
> everywhere". It only takes one slipup to share confidential
> information across patients/customers.
>
>> The problem is that this filter function takes too long. ( >10minutes)
>>     1). To test whether the filter function is crashing, I replicated the
>> entire DB to another un-loaded machine, and it seems to run just fine.
>> (Well it takes about 2.5 minutes, but it doesn't crash.)
>>     2). I've tried re-writing the filter function in ERLANG, but haven't
>> managed to get it working.
>>
>> And besides, I suspect that the way the DB is structured is just not suited
>> to the job.
>>
>> So, to summarize...
>>     - Android client phones produce new CouchDB docs and replicate them to
>> the server.
>>     - One central CouchDB holds all users.
>>     - Both individual and group data are served to web pages.
>>     - A data mining algorithm processes this data on a per-user basis.
>>     - Subjective data and Impact Factor data documents are replicated from
>> the server to each client phone.
>>
>> Is there a way to structure the DB so that users can replicate without the
>> need for filters, but which preserves the ability of clinicians to see an
>> overview of all users? (It's my understanding that views can't be run *
>> across* databases.)
>
> In summary, either turn the N^2 filter problem into a O(N)
> pre-calculated view, or use a per-user DB.
>
> And ideally do both, if disk & other constraints are feasible.
>
>> Well, as before, any suggestions or pointers would be much appreciated.
>>
>> Cheers,
>> Tim.
>
> A+
> Dave