Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jason@eggnet.com designates
 74.125.46.158 as permitted sender)
Message-ID: <1d579a700804301926w72f5d58eye1060e06e8285c5f@mail.gmail.com>
Date: Wed, 30 Apr 2008 19:26:02 -0700
From: "Jason Eggleston" <jason@eggnet.com>
To: couchdb-user@incubator.apache.org
Subject: Re: Map Reduce aggregate queries
In-Reply-To: <72DC24A3-B9B9-46D0-9CE2-4A5EA2222375@apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <1d579a700804291936m3a45ce15hcdb8e6ff3c1fd5f3@mail.gmail.com>
	 <72DC24A3-B9B9-46D0-9CE2-4A5EA2222375@apache.org>

Jan,

Thanks for replying.

Just to be clear, and I'm 90% sure you see my point as it was
intended, the functionality I'm asking for I'm pretty sure is outside
the Google paper and Damien's papers on map / reduce / combine.  In
that sense, combine() was a poor choice of words on my part, I just
didn't know what else to call it.  Maybe aggregate() is better, but
essentially the aggregate() function would be the same as reduce() in
all likelihood, except the "key" parameter would be invalid and
couldn't be used in the function.

The reduce function in Damien and Google's papers restrict computation
to a single key.  My main point is, that is a good thing, but since
you have this b-tree index and an associative function that can take
its own output as input, why not do both, why not do what reduce does
now, but additionally keep applying the function all the way up the
tree so that you could extremely rapidly compute arbitrary
aggregations of any range of keys.

That's why I suggested a new function to do this additional
aggregation, since the key itself shouldn't be an argument of the new
function.  It would be used to compute arbitrary aggregations of
already reduced data.

An additional example might be this.  Assume the b-tree index was two
layers, a root node and 3 leaf nodes.  The root node would have 10
entries, each of those entries would store the output of this new
function applied to all of the children.  Querying the aggregate
function, for the start and end conditions you'd have to process an
aggregate for a fractional b-tree node, but for all of the nodes in
between you could use the pre-computed aggregates in the middle, then
compute the aggregate of the result.  The aggregate function itself
would generally be exactly the same as the reduce() function.

Continuing with the simplistic hours per day reduction in my original
example (but with better data), the b-tree index might look like this.

Interior and root node:

{child keys are < this value, aggregate for child node, pointer to next node}

Leaf node:

{key, value}

Root:

2008-01-05, 41, Leaf 1
2008-01-08, 24, Leaf 2
null, 43, Leaf 3

Leaf 1:

2008-01-01, 10
2008-01-02, 14
2008-01-03, 14
2008-01-04, 3

Leaf 2:

2008-01-05, 3
2008-01-06, 12
2008-01-07, 9

Leaf 3:

2008-01-08, 1
2008-01-09, 16
2008-01-10, 17
2008-01-11, 9

If we wanted the aggregate of 2008-01-02 to 2008-01-10:

aggregate(
  aggregate(2008-01-02 to 2008-01-04 of Leaf 1)
  (all of Leaf 2, answer is pre-computed in parent node, i.e. root)
  aggregate(2008-01-08 to 2008-01-10 of Leaf 3)
)

Only the beginning and end of the list would potentially have to be
computed on the fly plus a final aggregate for everything.  I believe
it would be a rapid operation and could be done on the fly, but I
could be wrong.

The problem with doing this client side is the client doesn't have
access to the b-tree index itself.  You could work around this by
creating multiple map / reduce indexes at different granularities with
similar effect (i.e., have a monthly, yearly, and decade level views)
but it seems a bit awkward.  I realize my suggestion here might shoot
down this idea altogether :) but if this can be done in the index
itself efficiently, I think there is a real win here.

Thanks,
 -Jason

On Wed, Apr 30, 2008 at 4:06 AM, Jan Lehnardt <jan@apache.org> wrote:
>
>
>  On Apr 30, 2008, at 04:36, Jason Eggleston wrote:
>
> > Hello,
> >
> > I recently tried explaining the power of map / reduce to someone,
> > using a simple example of adding up hours spent on a project over a
> > period of time.
> >
> > The output of the mapping function would be:
> >
> > key, value
> >
> > 4/28/2008, 1
> > 4/28/2008, 4
> > 4/29/2008, 7
> > 4/29/2008, 3
> > ...
> >
> > The output of the reduce function would be:
> >
> > key, value
> >
> > 4/28/2008, 5
> > 4/29/2008, 10
> > ...
> > etc.
> >
> > Shouldn't I then be able to issue a query for an arbitrary summation
> > of some sequential portion of the output of reduce?  For example, a
> > sum of all of the hours between two dates / keys?  It's the same
> > reduction function except applied across multiple keys, it could be
> > annotated in the b-tree index the same way the reduction per key is
> > done now (or will be done).
> >
> > /db/_view/sum_hours/count?start_combine=4/28/2008&end_combine=4/29/2009
> >
> > If it isn't safe to use the reduce function as the combine function,
> > it could be explicit:
> >
> > "views": {
> >
> >            "all": "function (doc) {...snip...}",
> >
> >            "count": {
> >
> >                          "map": "function (doc){...snip...}"
> >
> >                          "reduce": "function (tag, counts){...snip...}"
> >
> >                          "combine": "function (counts){...snip...}"
> >
> >                          }
> >
> >            }
> >
> > }
> >
> > It seems very powerful to me, if this functionality were available.
> >
>
>  If understand Damien's "whitepaper(s)" correctly, the implementation
>  of our reduce would be functionally equivalent to the combine you
>  ask for. What I haven't seen is how to run reduce again on its own
>  output. Damien?
>
>  See http://damienkatz.net/2008/02/incremental_map.html
>  http://damienkatz.net/2008/02/incremental_map_1.html and the
>  comments for details.
>
>
>  Cheers
>  Jan
>  --
>

Jaon,