Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of b.candler@pobox.com
 designates 208.72.237.25 as permitted sender)
Date: Mon, 22 Jun 2009 10:07:35 +0100
From: Brian Candler <B.Candler@pobox.com>
To: Daniel =?iso-8859-1?Q?Tr=FCmper?= <truemped@googlemail.com>
Cc: user@couchdb.apache.org
Subject: Re: 'Grouping' documents so that a set of documents is passed to the
 view function
Message-ID: <20090622090735.GB8538@uk.tiscali.com>
References: <234B2543-875F-47DB-B870-B583D2E2B3B7@googlemail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <234B2543-875F-47DB-B870-B583D2E2B3B7@googlemail.com>
User-Agent: Mutt/1.5.17+20080114 (2008-01-14)

On Fri, Jun 19, 2009 at 09:43:31AM +0200, Daniel Tr�mper wrote:
> Hi,
>
> I am somewhat new to CouchDB but have been doing some stuff with it and 
> this is my first post to the list so pardon if I am wrong :)
>
>
>> It would be really cool if there were some way to pass all the docs  
>> with a value of 1 for group_key to a single map function call, so I  
>> could do computation across those related documents and emit the  
>> results ...  I'm just using the magic group_key attribute as an  
>> example, if such a feature were to actually be made I'd think you'd  
>> define a javascript function which returned a single groupping k to  
>> exist I
> I think this is what the reduce function is for.

No, I'm afraid it's not.

The OP wants to calculate information across a group of related documents.
CouchDB does not guarantee that all the related documents will be passed to
the reduce function at the same time. It may pass documents (d1,d2,d3) to
the reduce function to generate Rx, then pass (d4,d5,d6) to the reduce
function to generate Ry, then (d7,d8,d9) to generate Rz, then pass
(Rx,Ry,Rz) to the re-reduce function to generate the final R value.

If the values sharing the key were e.g. d3,d4 then you won't be able to
process them together, as they would not be presented to the reduce function
at the same time.

Using a grouped reduce query is better (i.e. group=true), but a large set of
documents sharing the same group key are still likely to be split into
several reductions with a re-reduce. The OP was talking about ~100 documents
sharing this key, and so they may well be split this way.

Furthermore, CouchDB optimises its reductions by storing the reduced value
for all the documents within the same Btree node. For example, suppose you
have

   +-------------+  +-------------+  +-------------+
   | d1 d2 d3 Rx |  | d4 d5 d6 Ry |  | d7 d8 d9 Rz |
   +-------------+  +-------------+  +-------------+

Then you make a reduce query for the key range which includes documents d2
to d8 inclusive (or a grouped query where d2 to d8 share the same group
key). CouchDB will calculate:

  R1 = Reduce(d2,d3)
  R2 = Reduce(d7,d8)
  R  = Rereduce(R1,Ry,R2)

That is: the already-reduced value of Ry=Reduce(d4,d5,d6) is reused without
recomputation. So the reduce function doesn't see documents d4 to d6 again.

So in summary: you cannot rely on the reduce function to be able to process
adjacent documents. You *must* do this sort of processing client-side.

HTH,

Brian.