From user-return-23277-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Thu Jan 17 13:33:15 2013 Return-Path: X-Original-To: apmail-couchdb-user-archive@www.apache.org Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A6659E283 for ; Thu, 17 Jan 2013 13:33:15 +0000 (UTC) Received: (qmail 35223 invoked by uid 500); 17 Jan 2013 13:33:14 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 35199 invoked by uid 500); 17 Jan 2013 13:33:14 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 35190 invoked by uid 99); 17 Jan 2013 13:33:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 13:33:13 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.220.179] (HELO mail-vc0-f179.google.com) (209.85.220.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jan 2013 13:33:05 +0000 Received: by mail-vc0-f179.google.com with SMTP id p1so2515131vcq.10 for ; Thu, 17 Jan 2013 05:32:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type:x-gm-message-state; bh=hfTF+a+aeg45k86PD8F63pvlYls16Gjj1kV1K9lZVd8=; b=MJlDTlQth32Vo0Hf2+OiUVh/QDUqsi9yATcjrXMBrZNZvAPobfWvsYJltzwcBRss7P jUyLNfe389TDg8YyIjmJlTTHvWowD+yjXH5aVLICW4u2zk80g0Ol7A5IRbU7mb8vA9Zc 1s5u/JgHHl9EoMWluxP3FzS/NuCNKt1v1FXxy8NsUg3zh2KpMhItDfbkEHXfja9No7b2 vHfXDjdsCgLpsXYYe8LqBaee4WJ2J1nAtZqZWjvjBgF8cahmqGjdNFllxsEf2ttioKor khybmlbj7jPd9/sZSGZBx2S96EkalNiaJBU/p19yeXnHmf+rlBN9y5zqOLXQpQGlR4VB II6w== MIME-Version: 1.0 X-Received: by 10.52.97.104 with SMTP id dz8mr4751566vdb.21.1358429544079; Thu, 17 Jan 2013 05:32:24 -0800 (PST) Received: by 10.58.100.180 with HTTP; Thu, 17 Jan 2013 05:32:23 -0800 (PST) X-Originating-IP: [84.112.19.176] In-Reply-To: References: Date: Thu, 17 Jan 2013 14:32:23 +0100 Message-ID: Subject: Re: Refactoring a CouchDB. From: Dave Cottlehuber To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQn4m8R6wwH6eKf0jjT+yUv3UnpQJBdFE+sLkzunFnaxgRZ9Mc3VqvFDY57mN1CS6FOh/0bb X-Virus-Checked: Checked by ClamAV on apache.org On 17 January 2013 13:08, Tim Hankins wrote: > Hi, > > I'm a student programmer at the IT University of Copenhagen, and have > inherited a CouchDB application, which I'm having trouble scaling. I > believe it may need to be refactored. > > Specifically, the problem seems to be coming from the use of Filtered > Replication. (All user documents are stored in the same database, and > replicating from server to client requires filtered replication.) Yes, this seems very likely. The main constraint is that replication filters need to be run per document, per replication. So N replications requires N passes through all the documents, ie N^2. And in your case most of the documents will not be replicated to a given user. There are 2 main approaches, and a possible 3rd one: 1. keep using the same replication approach, but create an additional server-side per-user DB. Move all endpoints to access their private DB only, and non-filtered replication is now possible. Use a master DB that replicates all docs from the private DB to the master DB, for views across all users, and implement an additional replication on the server-side private DBs to retrieve their data. 2. use named document replication to transfer only the required documents from the master DB to the endpoint DB. To identify what documents need to be transferred per user, create a view that exposes only the user's name. You can therefore avoid the N^2 filter pass above, as it will be done once within the view. http://wiki.apache.org/couchdb/Replication#Named_Document_Replication You can probably use the update_seq (available in the json properties of http://couch:5984/dbname/ ) as a checkpoint of where you are / were up to, but I can think of a few corner cases where this might come back and haunt you later. Anybody else want to comment? 3. Do something (handwavey) with the _changes feed to ensure each document only needs to be processed/filtered once, and use this to avoid an on-disk view. Somewhere you'll need a list of documents then that have been sent to a specific client, so I'm not sure that 1 or 2 aren't already better. But maybe your specific use case can work with this more easily. > Each day, the android client application collects two kinds of data. > Subjective and Objective. Subjective data are manually entered by patients. > Objective data are gathered from the phone's sensors. > > Subjective and Objective data are stored in their own couch documents, and > have IDs that include the user's ID, the document type, and the date in a > "DD_MM_YYYY" format. They are replicated once a day by placing replication > docs in the "_replicator" database. That sounds like overkill for a single document id. Ideally you keep your doc ids short (as they're used everywhere as , well, ids) and put the extra info into separate fields within the document. You can easily create a view to reconstruct that same data format from the document if absolutely required. > Once replicated to the server, these documents are... > 1). Used as input to a data mining algorithm. > 2). Displayed on a web page. (Users can see their own data, and > clinicians can see the data for all users.) > > The data mining algorithm produces a new CouchDB document for each user > every day, which we call an "Impact Factor" document. (It looks at each > user's historical objective and subjective data, and looks for > correlations.) Cool! This sounds impressive. > Replication: Replication takes place from client to server, and from server > to client. > 1). Client to server: This seems to be working fine. > 2). Server to client: This is what's broken. > > Two things have to be replicated from server to client. > 1). Each user's subjective data for the past 14 days. > 2). Each user's Impact Factor document for the current day. > > Since all user documents are stored in the same database, we use filtered > replication to send the right docs to the right users. >From a privacy perspective, I would always use a per-user server-side DB as the replication endpoint. A Wise Man once said "always plan that all your data gets replicated, everywhere". It only takes one slipup to share confidential information across patients/customers. > The problem is that this filter function takes too long. ( >10minutes) > 1). To test whether the filter function is crashing, I replicated the > entire DB to another un-loaded machine, and it seems to run just fine. > (Well it takes about 2.5 minutes, but it doesn't crash.) > 2). I've tried re-writing the filter function in ERLANG, but haven't > managed to get it working. > > And besides, I suspect that the way the DB is structured is just not suited > to the job. > > So, to summarize... > - Android client phones produce new CouchDB docs and replicate them to > the server. > - One central CouchDB holds all users. > - Both individual and group data are served to web pages. > - A data mining algorithm processes this data on a per-user basis. > - Subjective data and Impact Factor data documents are replicated from > the server to each client phone. > > Is there a way to structure the DB so that users can replicate without the > need for filters, but which preserves the ability of clinicians to see an > overview of all users? (It's my understanding that views can't be run * > across* databases.) In summary, either turn the N^2 filter problem into a O(N) pre-calculated view, or use a per-user DB. And ideally do both, if disk & other constraints are feasible. > Well, as before, any suggestions or pointers would be much appreciated. > > Cheers, > Tim. A+ Dave