Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 6262 invoked from network); 1 May 2008 02:26:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 May 2008 02:26:39 -0000 Received: (qmail 84504 invoked by uid 500); 1 May 2008 02:26:40 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 84479 invoked by uid 500); 1 May 2008 02:26:40 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 84468 invoked by uid 99); 1 May 2008 02:26:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2008 19:26:40 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jason@eggnet.com designates 74.125.46.158 as permitted sender) Received: from [74.125.46.158] (HELO yw-out-1718.google.com) (74.125.46.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 May 2008 02:25:53 +0000 Received: by yw-out-1718.google.com with SMTP id 5so442710ywm.0 for ; Wed, 30 Apr 2008 19:26:03 -0700 (PDT) Received: by 10.150.50.3 with SMTP id x3mr1927030ybx.30.1209608762936; Wed, 30 Apr 2008 19:26:02 -0700 (PDT) Received: by 10.150.135.3 with HTTP; Wed, 30 Apr 2008 19:26:02 -0700 (PDT) Message-ID: <1d579a700804301926w72f5d58eye1060e06e8285c5f@mail.gmail.com> Date: Wed, 30 Apr 2008 19:26:02 -0700 From: "Jason Eggleston" To: couchdb-user@incubator.apache.org Subject: Re: Map Reduce aggregate queries In-Reply-To: <72DC24A3-B9B9-46D0-9CE2-4A5EA2222375@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <1d579a700804291936m3a45ce15hcdb8e6ff3c1fd5f3@mail.gmail.com> <72DC24A3-B9B9-46D0-9CE2-4A5EA2222375@apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Jan, Thanks for replying. Just to be clear, and I'm 90% sure you see my point as it was intended, the functionality I'm asking for I'm pretty sure is outside the Google paper and Damien's papers on map / reduce / combine. In that sense, combine() was a poor choice of words on my part, I just didn't know what else to call it. Maybe aggregate() is better, but essentially the aggregate() function would be the same as reduce() in all likelihood, except the "key" parameter would be invalid and couldn't be used in the function. The reduce function in Damien and Google's papers restrict computation to a single key. My main point is, that is a good thing, but since you have this b-tree index and an associative function that can take its own output as input, why not do both, why not do what reduce does now, but additionally keep applying the function all the way up the tree so that you could extremely rapidly compute arbitrary aggregations of any range of keys. That's why I suggested a new function to do this additional aggregation, since the key itself shouldn't be an argument of the new function. It would be used to compute arbitrary aggregations of already reduced data. An additional example might be this. Assume the b-tree index was two layers, a root node and 3 leaf nodes. The root node would have 10 entries, each of those entries would store the output of this new function applied to all of the children. Querying the aggregate function, for the start and end conditions you'd have to process an aggregate for a fractional b-tree node, but for all of the nodes in between you could use the pre-computed aggregates in the middle, then compute the aggregate of the result. The aggregate function itself would generally be exactly the same as the reduce() function. Continuing with the simplistic hours per day reduction in my original example (but with better data), the b-tree index might look like this. Interior and root node: {child keys are < this value, aggregate for child node, pointer to next node} Leaf node: {key, value} Root: 2008-01-05, 41, Leaf 1 2008-01-08, 24, Leaf 2 null, 43, Leaf 3 Leaf 1: 2008-01-01, 10 2008-01-02, 14 2008-01-03, 14 2008-01-04, 3 Leaf 2: 2008-01-05, 3 2008-01-06, 12 2008-01-07, 9 Leaf 3: 2008-01-08, 1 2008-01-09, 16 2008-01-10, 17 2008-01-11, 9 If we wanted the aggregate of 2008-01-02 to 2008-01-10: aggregate( aggregate(2008-01-02 to 2008-01-04 of Leaf 1) (all of Leaf 2, answer is pre-computed in parent node, i.e. root) aggregate(2008-01-08 to 2008-01-10 of Leaf 3) ) Only the beginning and end of the list would potentially have to be computed on the fly plus a final aggregate for everything. I believe it would be a rapid operation and could be done on the fly, but I could be wrong. The problem with doing this client side is the client doesn't have access to the b-tree index itself. You could work around this by creating multiple map / reduce indexes at different granularities with similar effect (i.e., have a monthly, yearly, and decade level views) but it seems a bit awkward. I realize my suggestion here might shoot down this idea altogether :) but if this can be done in the index itself efficiently, I think there is a real win here. Thanks, -Jason On Wed, Apr 30, 2008 at 4:06 AM, Jan Lehnardt wrote: > > > On Apr 30, 2008, at 04:36, Jason Eggleston wrote: > > > Hello, > > > > I recently tried explaining the power of map / reduce to someone, > > using a simple example of adding up hours spent on a project over a > > period of time. > > > > The output of the mapping function would be: > > > > key, value > > > > 4/28/2008, 1 > > 4/28/2008, 4 > > 4/29/2008, 7 > > 4/29/2008, 3 > > ... > > > > The output of the reduce function would be: > > > > key, value > > > > 4/28/2008, 5 > > 4/29/2008, 10 > > ... > > etc. > > > > Shouldn't I then be able to issue a query for an arbitrary summation > > of some sequential portion of the output of reduce? For example, a > > sum of all of the hours between two dates / keys? It's the same > > reduction function except applied across multiple keys, it could be > > annotated in the b-tree index the same way the reduction per key is > > done now (or will be done). > > > > /db/_view/sum_hours/count?start_combine=4/28/2008&end_combine=4/29/2009 > > > > If it isn't safe to use the reduce function as the combine function, > > it could be explicit: > > > > "views": { > > > > "all": "function (doc) {...snip...}", > > > > "count": { > > > > "map": "function (doc){...snip...}" > > > > "reduce": "function (tag, counts){...snip...}" > > > > "combine": "function (counts){...snip...}" > > > > } > > > > } > > > > } > > > > It seems very powerful to me, if this functionality were available. > > > > If understand Damien's "whitepaper(s)" correctly, the implementation > of our reduce would be functionally equivalent to the combine you > ask for. What I haven't seen is how to run reduce again on its own > output. Damien? > > See http://damienkatz.net/2008/02/incremental_map.html > http://damienkatz.net/2008/02/incremental_map_1.html and the > comments for details. > > > Cheers > Jan > -- > Jaon,