Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of gcdcu-couchdb-user@m.gmane.org
 designates 80.91.229.12 as permitted sender)
To: user@couchdb.apache.org
From: Nicolas Jessus <nicolas.jessus@lores.org>
Subject: Re: Forcing document reindex
Date: Wed, 17 Nov 2010 18:00:00 +0000 (UTC)
Lines: 64
Message-ID: <loom.20101117T183544-757@post.gmane.org>
References: <loom.20101117T134738-719@post.gmane.org>
 <76A109FD-9829-4EAA-9BA1-0FAC29357EA9@apache.org>
 <loom.20101117T142932-408@post.gmane.org>
 <7D7C2F35-4630-494D-BD39-C446FCB3486E@apache.org>
 <loom.20101117T151423-281@post.gmane.org>
 <D7D21638-2CC6-4F68-B318-F3A8FC000B12@apache.org>
 <loom.20101117T162904-685@post.gmane.org> <4CE40F8A.10106@aol.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
User-Agent: Loom/3.14 (http://gmane.org/)

Hello Cliff,

> I am not sure if I fully understand your use case (however it does sound 
> intriguing and unusual).

Sorry, I'll try to be clearer. I should have taken a real case to start with, I
just didn't want to be necessarily verbose (failed!). 

Consider 5 types of documents:

type: Meeting
_id: M1
meetingProposalID: MP1
date: 2010-09-09

type: MeetingProposal
_id: MP1
projectPartID: PP1
date: 2010-10-10

type: ProjectPart
_id: PP1
projectID: P1

type: Project
_id: P1
clientID: C1

type: Client
_id: C1
name: John

ProjectPart can be denormalised into Project, but let's ignore that.

Let's say I would like to know the average time between a meeting proposal and
the actual meeting, per client, to see what kind of delay I should expect. This 
is a simple report, others are much more complex, so I'm really looking to solve
the general case problem.

Naively, the key should be something like [clientName, dateMP1, dateM1], or
maybe [clientName] and a value of [dateMP1, dateM1]. There can be hundreds of
thousands of meetings. The problem is to generate the key triplet when there's
no common ID between the documents.


> I assume that you are getting data out of your legacy MySQL system using 
> complex joins.??
Yes, although the joins aren't complex, the data model is pretty
straightforward, with docs mostly in a chain. 

> Have you considered totally denormalising your data and input data to 
> couchdb based on the output of your MySQL reports ??
Yes, but that would not really work - each document can still be updated on its
own, with maybe a few thousand updates a day, which is little but enough to
cause massive locks if there are massively denormalised documents.

> Perhaps couchdb-lucene (or my current fav of the moment elasticsearch 
> which is also based on lucene) would be useful ??
I already have set it up, and modified it to make simple doc joins without fuss,
which is good enough for run-of-the-mill searching. It wouldn't resolve the
million-doc-pull problem, though, and joining is obviously pretty slow.


But thanks for the proposals :)