incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hinrichs - DM&T" <dunde...@gmail.com>
Subject Re: reduce/rereduce confusion
Date Thu, 22 Jan 2009 00:25:12 GMT
On Wed, Jan 21, 2009 at 10:56 AM, Adam Wolff <awolff@gmail.com> wrote:
> Ok great. Thanks for the clarification. So, as far as the implementation
> goes, if I ask for a key range from a view with a reduce function, is the
> value recomputed for that key range every time, or is it cached somehow? The
> detailed answer will probably make my eyes glaze over; I'm just trying to
> understand if it's constant time, or it's a complex algorithm but cached, or
> whatever.
> And then, just to state it explicitly, since reduce can be called with
> arbitrary key ranges, is there any meaning to the way the key ranges get
> broken up? Are there any guarantees about where the boundaries in the key
> ranges will be?
>
> Finally, why isn't my reduce function always run a final time with
> rereduce=true to produce final, consolidated output? Does rereduce sometimes
> run at query time?

rereduce = True only happens when couch calls your reduce function and
is feeding it datapoints you created that are not the same as the type
as originally sent to the reduce.

If your map function goes something like:

def reduceFn(keys,values,rereduce):
  return sum(values)  #values are a list of integers

you some them up and return an integer result.  The output type of the
reduceFn matches the original input type of the values then rereduce
will never be True.

Now say you want to do something fancy with the reduce -- I don't have
a good example but bear with me.

def reduceFn(keys,values,rereduce):
  if not rereduce:
    #keys has a list of keys and values has a list of integers
    .. do some combining and return a structure -- say a dictionary or a hash
    return myDict
  else:
    # rereduce is now true and couch is sending the function my
dictionaries/hashes emitted in the not rereduce logic
    # keys is empty and values is a list of your objects
    ... do some further processing and return results

So in short, if your reduce function returns an aggregate that is the
same type as the values in the values list, rereduce will not be True.

I hope I haven't butchered this too badly.    If so, try reading
through the below and see if it does a better job.
http://www.mail-archive.com/couchdb-user@incubator.apache.org/msg01309.html

Regards,

Jeff
sorry for writing python in the examples, but I've been doing it all
day and I just couldn't get the js to kick in ;)

Some time ago, I used to right code and let the compiler sort it out,
Today I write code and let my editor<g>.

>
> Thanks again,
> A
>
> On Tue, Jan 20, 2009 at 9:13 PM, Chris Anderson <jchris@apache.org> wrote:
>
>> On Tue, Jan 20, 2009 at 8:05 PM, Adam Wolff <awolff@gmail.com> wrote:
>> > After looking at this more, let me restate. I would totally get all of
>> this
>> > if the signature of reduce was:reduce: function(key, values, rereduce)
>> >
>> > What I don't get is: why does reduce get called with an arbitrarily long
>> > list of keys? I thought reduce was precisely for reducing all of the
>> mapped
>> > inputs that are indexed under the *same* key. I think if I can get that,
>> the
>> > rest will come clear.
>>
>> The thing that makes CouchDB's reduce different from, say, the Hadoop
>> implementation, is that it does not group by key at computation time.
>>
>> Instead, a reduce function should aim to return a single value for the
>> entire view. Eg, 15,346, which could be the total number of posts in
>> your view
>>
>> CouchDB allows you to query for reduction values for any arbitrary key
>> range very efficiently. So depending on your key structure, if you
>> want the total number of posts by jchris in January, you could ask for
>> reduce for all keys between
>>
>> ["jchris",[2009,0]] and ["jchris",[2009,0,{}]]
>>
>> and get a result of, say, 14.
>>
>> For details about specifying start and end keys see
>> http://wiki.apache.org/couchdb/View_collation
>>
>> The group=true and group_level parameters may seem confusing at first,
>> but once you understand that they are just macros for running a series
>> of reduce queries (where CouchDB will pick key ranges for you), they
>> aren't so mysterious.
>>
>> >
>> > Thanks again,
>> > A
>> >
>> > On Tue, Jan 20, 2009 at 7:52 PM, Adam Wolff <awolff@gmail.com> wrote:
>> >
>> >> Thanks for the reply!
>> >> I'd seen all of this, though I re-read the wikipedia entry carefully.
>> >> Damien's blog entries don't appear to match the APIs in the version I'm
>> >> running, which is 0.8.1
>> >> The wikipedia entry suggests that reduce is called only with values that
>> >> match a single key. Using the log() function in CouchDB, I can see
>> that's
>> >> not the case for its reduce function -- it's called with multiple
>> different
>> >> keys, though it does appear that the input values are *ordered* by
>> matching
>> >> keys.
>> >>
>> >> Anyway, I totally get how re-reduce (or "combine") works in conventional
>> >> map/reduce, but I'm hazy on the details w/r/t to CouchDB. I'm starting
>> to
>> >> understand the answer to #1, but I'm really unclear on #2 (how/why
>> rereduce
>> >> is run.)
>> >>
>> >> Thanks again,
>> >> A
>> >>
>> >>
>> >> On Tue, Jan 20, 2009 at 6:50 PM, Jeff Hinrichs - DM&T <
>> dundeemt@gmail.com>wrote:
>> >>
>> >>> On Tue, Jan 20, 2009 at 7:47 PM, Adam Wolff <awolff@gmail.com>
wrote:
>> >>> > Hi everyone,I'm really excited about CouchDB and I've started playing
>> >>> with
>> >>> > it. I get all of it, except for reduce, and especially re-reduce.
>> >>> >
>> >>> > My first question is: how does CouchDB maintain all the separate
>> output
>> >>> for
>> >>> > a given key from the map function? I mean: given a simple reduce
that
>> >>> just
>> >>> > sums results, how does couch maintain separate results for each
>> possible
>> >>> > key/key range that can be given as input to that view?
>> >>> >
>> >>> > My second question: when and why does rereduce get called? Is this
>> >>> simply to
>> >>> > allow the server to chunk the processing, or is there semantic
>> meaning
>> >>> to
>> >>> > it? I had assumed the former -- it's just a way of limiting the
size
>> of
>> >>> the
>> >>> > input to the reduce function -- but then this really confused me:
if
>> I
>> >>> log
>> >>> > each time my reduce function gets called, I see that the last time
>> it's
>> >>> > called, it's with rereduce=false. How is this possible? Don't all
the
>> >>> > results have to be funneled through rereduce to produce a single
>> result
>> >>> > value?
>> >>> >
>> >>> > Any help here would be much appreciated. If there's a resource
on the
>> >>> web I
>> >>> > should look at, please send it my way. Thanks!
>> >>> >
>> >>> > A
>> >>> Being that I just went through the learning process on reduce, I'll
>> >>> point you here:
>> >>> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>> >>> "Reduce Functions"
>> >>>
>> >>> As a good place to start.
>> >>> Also, the mailing list, is an excellent resource.
>> >>>
>> >>>
>> http://mail-archives.apache.org/mod_mbox/couchdb-user/200901.mbox/%3c61B374C7-34D7-45C3-9F8B-F11EFD77303D@apache.org%3e
>> >>>
>> >>> along with:
>> >>> http://en.wikipedia.org/wiki/MapReduce
>> >>> http://labs.google.com/papers/mapreduce.html
>> >>> and
>> >>> http://damienkatz.net/2008/02/incremental_map.html
>> >>>
>> >>> Regards,
>> >>>
>> >>> Jeff
>> >>>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchris.mfdz.com
>>
>

Mime
View raw message