Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of timchankins@gmail.com
 designates 74.125.83.49 as permitted sender)
MIME-Version: 1.0
Date: Thu, 17 Jan 2013 13:08:01 +0100
Message-ID: 
 <CAHfPQSntOq-9YXszvectbbNrx5Nzgd_9_7Ajs=UHsQaqkdDX3w@mail.gmail.com>
Subject: Refactoring a CouchDB.
From: Tim Hankins <timchankins@gmail.com>
To: user@couchdb.apache.org
Content-Type: multipart/alternative; boundary=047d7b34413ca6752804d37ad8d1

--047d7b34413ca6752804d37ad8d1
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I'm a student programmer at the IT University of Copenhagen, and have
inherited a CouchDB application, which I'm having trouble scaling. I
believe it may need to be refactored.

Specifically, the problem seems to be coming from the use of Filtered
Replication. (All user documents are stored in the same database, and
replicating from server to client requires filtered replication.)

I'm in the process of reading Chapter 23 of "O'Reilly: CouchDB - The
Definitive Guide" which deals with High Performance, and "O'Reilly: Scaling
CouchDB". Any other suggestions about the following would be greatly
appreciated!

Some background...

The system is part of a clinical trial undertaken by the ITU and the Danish
State Hospital. It aims to help Bipolar patients manage their disease. It
is composed of
    1). 100+ android phones running a client application and Couchbase
Mobile.
    2). A web server backed by CouchDB.

Each day, the android client application collects two kinds of data.
Subjective and Objective. Subjective data are manually entered by patients.
Objective data are gathered from the phone's sensors.

Subjective and Objective data are stored in their own couch documents, and
have IDs that include the user's ID, the document type, and the date in a
"DD_MM_YYYY" format. They are replicated once a day by placing replication
docs in the "_replicator" database.

Once replicated to the server, these documents are...
    1). Used as input to a data mining algorithm.
    2). Displayed on a web page. (Users can see their own data, and
clinicians can see the data for all users.)

The data mining algorithm produces a new CouchDB document for each user
every day, which we call an "Impact Factor" document. (It looks at each
user's historical objective and subjective data, and looks for
correlations.)

Replication: Replication takes place from client to server, and from server
to client.
    1). Client to server: This seems to be working fine.
    2). Server to client: This is what's broken.

Two things have to be replicated from server to client.
    1). Each user's subjective data for the past 14 days.
    2). Each user's Impact Factor document for the current day.

Since all user documents are stored in the same database, we use filtered
replication to send the right docs to the right users.

The problem is that this filter function takes too long. ( >10minutes)
    1). To test whether the filter function is crashing, I replicated the
entire DB to another un-loaded machine, and it seems to run just fine.
(Well it takes about 2.5 minutes, but it doesn't crash.)
    2). I've tried re-writing the filter function in ERLANG, but haven't
managed to get it working.

And besides, I suspect that the way the DB is structured is just not suited
to the job.

So, to summarize...
    - Android client phones produce new CouchDB docs and replicate them to
the server.
    - One central CouchDB holds all users.
    - Both individual and group data are served to web pages.
    - A data mining algorithm processes this data on a per-user basis.
    - Subjective data and Impact Factor data documents are replicated from
the server to each client phone.

Is there a way to structure the DB so that users can replicate without the
need for filters, but which preserves the ability of clinicians to see an
overview of all users? (It's my understanding that views can't be run *
across* databases.)

Well, as before, any suggestions or pointers would be much appreciated.

Cheers,
Tim.

--047d7b34413ca6752804d37ad8d1--