Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 4808 invoked from network); 30 Mar 2009 08:48:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Mar 2009 08:48:06 -0000 Received: (qmail 89528 invoked by uid 500); 30 Mar 2009 08:48:04 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 89442 invoked by uid 500); 30 Mar 2009 08:48:04 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 89432 invoked by uid 99); 30 Mar 2009 08:48:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2009 08:48:04 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of b.candler@pobox.com designates 207.106.133.19 as permitted sender) Received: from [207.106.133.19] (HELO sasl.smtp.pobox.com) (207.106.133.19) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2009 08:47:56 +0000 Received: from localhost.localdomain (unknown [127.0.0.1]) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTP id A1883A6016; Mon, 30 Mar 2009 04:47:31 -0400 (EDT) Received: from mappit (unknown [80.45.95.114]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by a-sasl-fastnet.sasl.smtp.pobox.com (Postfix) with ESMTPSA id D4C20A600E; Mon, 30 Mar 2009 04:47:29 -0400 (EDT) Received: from brian by mappit with local (Exim 4.69) (envelope-from ) id 1LoD9U-00026W-2U; Mon, 30 Mar 2009 09:47:28 +0100 Date: Mon, 30 Mar 2009 09:47:28 +0100 From: Brian Candler To: Tom McNulty Cc: user@couchdb.apache.org Subject: Re: Reduce Assumptions Message-ID: <20090330084727.GA7913@uk.tiscali.com> References: <9B391D82-C05A-4DC5-AEEA-3038AD5C7941@cetiforge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9B391D82-C05A-4DC5-AEEA-3038AD5C7941@cetiforge.com> User-Agent: Mutt/1.5.17+20080114 (2008-01-14) X-Pobox-Relay-ID: 681CB602-1D07-11DE-B3BF-32B0EBB1AA3C-28021239!a-sasl-fastnet.pobox.com X-Virus-Checked: Checked by ClamAV on apache.org On Sat, Mar 28, 2009 at 07:38:24PM -0600, Tom McNulty wrote: > my map function produces output like: > > [X, Y, 0] -> Object_A > [X, Y, 1] -> Object_B1 > [X, Y, 1] -> Object_B1 > [X, Y, 1] -> Object_B1 > [Z, Q, 0] .... > > Here I apply group_level=2, and use a ranged query ( [X, 0] to [X, [] ] ) > since Y >= 0 Aside: you can use [X,null] to [X,{}] and then it doesn't matter about the value of Y > Now during the reduce phase, I combine together Object_A's and > associated Object_B's. Can I assume that the first of the values sent to > 'reduce' is Object_A? I think not, because on a large database objects to be reduced will be sent to your reduce function in batches, and these batches will be broken up on B-tree boundaries, which may occur in arbitrary places. e.g. your reduce function may receive [Object_A, Object_B1] and then in a separate invocation [Object_B1, Object_B1] Furthermore: due to reduce optimisations, you may only receive some of the blocks to be reduced. Example: take these three Btree nodes: [a b c d e f g] [h i j k l m n] [o p q r s t u] R1 R2 R3 The reduce value of all the items in each Btree node is stored within each node, e.g. [a b c d e f g] reduces to R1. Now suppose someone asks for a reduce value across a key range: key range <-----------------------------> [a b c d e f g] [h i j k l m n] [o p q r s t u] As I understand it, CouchDB will call your reduce function to calculate a value for [e f g] and for [o p q r], but will use the existing stored/calculated value of R2 across the middle block. Therefore, it is wrong to attempt to maintain any sort of state in your reduce function between invocations. And because the Btree node boundaries can appear in any place, it is wrong to attempt to cross-reference adjacent documents too. So I believe this sort of processing needs to take place in the client, not in a reduce function. Regards, Brian.