incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hinrichs - DM&T" <dunde...@gmail.com>
Subject Re: reduce/rereduce confusion
Date Thu, 22 Jan 2009 00:28:11 GMT
On Wed, Jan 21, 2009 at 6:25 PM, Jeff Hinrichs - DM&T
<dundeemt@gmail.com> wrote:
> On Wed, Jan 21, 2009 at 10:56 AM, Adam Wolff <awolff@gmail.com> wrote:
>> Ok great. Thanks for the clarification. So, as far as the implementation
>> goes, if I ask for a key range from a view with a reduce function, is the
>> value recomputed for that key range every time, or is it cached somehow? The
>> detailed answer will probably make my eyes glaze over; I'm just trying to
>> understand if it's constant time, or it's a complex algorithm but cached, or
>> whatever.
>> And then, just to state it explicitly, since reduce can be called with
>> arbitrary key ranges, is there any meaning to the way the key ranges get
>> broken up? Are there any guarantees about where the boundaries in the key
>> ranges will be?
>>
>> Finally, why isn't my reduce function always run a final time with
>> rereduce=true to produce final, consolidated output? Does rereduce sometimes
>> run at query time?
>
> rereduce = True only happens when couch calls your reduce function and
> is feeding it datapoints you created that are not the same as the type
> as originally sent to the reduce.
>
> If your map function goes something like:
>
> def reduceFn(keys,values,rereduce):
>  return sum(values)  #values are a list of integers
>
> you some them up and return an integer result.  The output type of the
> reduceFn matches the original input type of the values then rereduce
> will never be True.
>
> Now say you want to do something fancy with the reduce -- I don't have
> a good example but bear with me.
>
> def reduceFn(keys,values,rereduce):
>  if not rereduce:
>    #keys has a list of keys and values has a list of integers
>    .. do some combining and return a structure -- say a dictionary or a hash
>    return myDict
>  else:
>    # rereduce is now true and couch is sending the function my
> dictionaries/hashes emitted in the not rereduce logic
>    # keys is empty and values is a list of your objects
>    ... do some further processing and return results
>
> So in short, if your reduce function returns an aggregate that is the
> same type as the values in the values list, rereduce will not be True.

Should continue above...
If your reduce returns results that are not the same type fed to it,
couch will eventually feed it back to you in the different type and
uses rereduce to tell you to eat your own dog food.

> I hope I haven't butchered this too badly.    If so, try reading
> through the below and see if it does a better job.
> http://www.mail-archive.com/couchdb-user@incubator.apache.org/msg01309.html
>
> Regards,
>
> Jeff
> sorry for writing python in the examples, but I've been doing it all
> day and I just couldn't get the js to kick in ;)
>
> Some time ago, I used to right code and let the compiler sort it out,
> Today I write code and let my editor<g>.
>
>>
>> Thanks again,
>> A
>>
>> On Tue, Jan 20, 2009 at 9:13 PM, Chris Anderson <jchris@apache.org> wrote:
>>
>>> On Tue, Jan 20, 2009 at 8:05 PM, Adam Wolff <awolff@gmail.com> wrote:
>>> > After looking at this more, let me restate. I would totally get all of
>>> this
>>> > if the signature of reduce was:reduce: function(key, values, rereduce)
>>> >
>>> > What I don't get is: why does reduce get called with an arbitrarily long
>>> > list of keys? I thought reduce was precisely for reducing all of the
>>> mapped
>>> > inputs that are indexed under the *same* key. I think if I can get that,
>>> the
>>> > rest will come clear.
>>>
>>> The thing that makes CouchDB's reduce different from, say, the Hadoop
>>> implementation, is that it does not group by key at computation time.
>>>
>>> Instead, a reduce function should aim to return a single value for the
>>> entire view. Eg, 15,346, which could be the total number of posts in
>>> your view
>>>
>>> CouchDB allows you to query for reduction values for any arbitrary key
>>> range very efficiently. So depending on your key structure, if you
>>> want the total number of posts by jchris in January, you could ask for
>>> reduce for all keys between
>>>
>>> ["jchris",[2009,0]] and ["jchris",[2009,0,{}]]
>>>
>>> and get a result of, say, 14.
>>>
>>> For details about specifying start and end keys see
>>> http://wiki.apache.org/couchdb/View_collation
>>>
>>> The group=true and group_level parameters may seem confusing at first,
>>> but once you understand that they are just macros for running a series
>>> of reduce queries (where CouchDB will pick key ranges for you), they
>>> aren't so mysterious.
>>>
>>> >
>>> > Thanks again,
>>> > A
>>> >
>>> > On Tue, Jan 20, 2009 at 7:52 PM, Adam Wolff <awolff@gmail.com> wrote:
>>> >
>>> >> Thanks for the reply!
>>> >> I'd seen all of this, though I re-read the wikipedia entry carefully.
>>> >> Damien's blog entries don't appear to match the APIs in the version
I'm
>>> >> running, which is 0.8.1
>>> >> The wikipedia entry suggests that reduce is called only with values
that
>>> >> match a single key. Using the log() function in CouchDB, I can see
>>> that's
>>> >> not the case for its reduce function -- it's called with multiple
>>> different
>>> >> keys, though it does appear that the input values are *ordered* by
>>> matching
>>> >> keys.
>>> >>
>>> >> Anyway, I totally get how re-reduce (or "combine") works in conventional
>>> >> map/reduce, but I'm hazy on the details w/r/t to CouchDB. I'm starting
>>> to
>>> >> understand the answer to #1, but I'm really unclear on #2 (how/why
>>> rereduce
>>> >> is run.)
>>> >>
>>> >> Thanks again,
>>> >> A
>>> >>
>>> >>
>>> >> On Tue, Jan 20, 2009 at 6:50 PM, Jeff Hinrichs - DM&T <
>>> dundeemt@gmail.com>wrote:
>>> >>
>>> >>> On Tue, Jan 20, 2009 at 7:47 PM, Adam Wolff <awolff@gmail.com>
wrote:
>>> >>> > Hi everyone,I'm really excited about CouchDB and I've started
playing
>>> >>> with
>>> >>> > it. I get all of it, except for reduce, and especially re-reduce.
>>> >>> >
>>> >>> > My first question is: how does CouchDB maintain all the separate
>>> output
>>> >>> for
>>> >>> > a given key from the map function? I mean: given a simple reduce
that
>>> >>> just
>>> >>> > sums results, how does couch maintain separate results for
each
>>> possible
>>> >>> > key/key range that can be given as input to that view?
>>> >>> >
>>> >>> > My second question: when and why does rereduce get called?
Is this
>>> >>> simply to
>>> >>> > allow the server to chunk the processing, or is there semantic
>>> meaning
>>> >>> to
>>> >>> > it? I had assumed the former -- it's just a way of limiting
the size
>>> of
>>> >>> the
>>> >>> > input to the reduce function -- but then this really confused
me: if
>>> I
>>> >>> log
>>> >>> > each time my reduce function gets called, I see that the last
time
>>> it's
>>> >>> > called, it's with rereduce=false. How is this possible? Don't
all the
>>> >>> > results have to be funneled through rereduce to produce a single
>>> result
>>> >>> > value?
>>> >>> >
>>> >>> > Any help here would be much appreciated. If there's a resource
on the
>>> >>> web I
>>> >>> > should look at, please send it my way. Thanks!
>>> >>> >
>>> >>> > A
>>> >>> Being that I just went through the learning process on reduce, I'll
>>> >>> point you here:
>>> >>> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>>> >>> "Reduce Functions"
>>> >>>
>>> >>> As a good place to start.
>>> >>> Also, the mailing list, is an excellent resource.
>>> >>>
>>> >>>
>>> http://mail-archives.apache.org/mod_mbox/couchdb-user/200901.mbox/%3c61B374C7-34D7-45C3-9F8B-F11EFD77303D@apache.org%3e
>>> >>>
>>> >>> along with:
>>> >>> http://en.wikipedia.org/wiki/MapReduce
>>> >>> http://labs.google.com/papers/mapreduce.html
>>> >>> and
>>> >>> http://damienkatz.net/2008/02/incremental_map.html
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Jeff
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchris.mfdz.com
>>>
>>
>



-- 
Jeff Hinrichs
Dundee Media & Technology, Inc
jeffh@dundeemt.com
402.218.1473
web: www.dundeemt.com
blog: inre.dundeemt.com

Mime
View raw message