incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: how do I do different reduce operations on the same map
Date Tue, 10 Feb 2009 22:51:58 GMT
On Tue, Feb 10, 2009 at 5:31 PM, James Marca
<jmarca@translab.its.uci.edu> wrote:
> I have a situation where I want to run two different reduce functions
> on the output of a single map function.  Like suppose I want one
> reduce function to get the count of all objects in each group (for
> example, documents with or without attachments), and another reduce to
> compute some other aggregate, like the average and standard deviation
> of a value, (like the average size of attached documents).  (Yes, I
> know this is a stupid example, as the averaging reduce function will
> also have the count, but my real case is too complicated to write
> easily).
>
> Should one strive for a minimal set of reduce functions per map (one
> reduce for all three count, average, std deviation), or does it make
> sense to identically copy the maps and make multiple reduce functions
> (one reduce _each_ for count, mean, std dev)?  (again, ignore the fact
> that you compute  count and mean when computing std dev, etc)
>
> I have a feeling from reading the various docs that identical map
> functions are only executed once in CouchDB.  If that is true, then is
> it _also_ true that having lots of reduce functions for one map is not
> any more expensive (in terms of space and computational speed) than
> trying for a minimal set of map-reduce pairs.  Any advice on this?
>
> Thanks in advance,
> James
>

You're reading of the docs are spot on. If you have byte identical map
functions, only a single btree is used for both maps. At the moment,
the only way to reuse a single btree with multiple reduce functions is
to do exactly what you suggested and copy your maps and then attach
your reduce functions as necessary.

Before I go on, I should mention that the best way to figure this out
would be to setup a couple benchmarks and measure if there's any
noticeable difference between having multiple reduce functions vs. one
complex one.

That said, with each reduce function, you're adding a round-trip
through the view server every time a reduce is called. I would
cautiously lean towards thinking that this isn't going to be as much
overhead as you might think. Ie, I find it more likely that the view
generation is going to be dominated by other things than this.

The space requirement should be roughly related to the output that
either method would produce. Ie, multiple reduce methods isn't in and
of itself going to cause you to run into problems. The only overhead I
can think of is a bit more for the Erlang serialization of a slightly
different term format for either case.

HTH,
Paul Davis

Mime
View raw message