couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hinrichs - DM&T" <dunde...@gmail.com>
Subject Re: on Reduce w/ Python view server
Date Tue, 20 Jan 2009 13:11:21 GMT
On Mon, Jan 19, 2009 at 10:44 PM, Paul Davis
<paul.joseph.davis@gmail.com> wrote:
> On Mon, Jan 19, 2009 at 6:15 PM, Jeff Hinrichs - DM&T
> <dundeemt@gmail.com> wrote:
>> On Mon, Jan 19, 2009 at 8:24 PM, Jeff Hinrichs - DM&T
>> <dundeemt@gmail.com> wrote:
>>>
>>>
>>> On Mon, Jan 19, 2009 at 6:25 PM, Paul Davis <paul.joseph.davis@gmail.com>
wrote:
>>>>
>>>> On Mon, Jan 19, 2009 at 7:15 PM, Jeff Hinrichs - DM&T
>>>> <jeffh@dundeemt.com> wrote:
>>>> > Using couchdb-python 0.5 and couchdb 0.9.0a735191-incubating
>>>> >
>>>> > I am working on a "rosetta stone" for javascript and python views. (nothing
>>>> > like rewriting in a different language to lean something<g>) I've
worked
>>>> > though the simple map only views.  I then did a simple reduce(keys,vals)
>>>> > which worked out fine.  But I am stumped on rereduce when the function
>>>> > signature includes keys,vals,rereduce.  I've been trying to work through
the
>>>> > js code from "Top N Tags" from the snippets page,
>>>> > http://wiki.apache.org/couchdb/View_Snippets
>>>> >
>>>> > In particular, I am confused as to what is passed to the rereduce function
>>>> > when rereduce=True.  The javascript is returning a complex structure
instead
>>>> > of a simple scalar, I see that it has to do with getting state back
from a
>>>> > previous reduce operation, but I'm confused. The unpacking of the previous
>>>> > results, I think has me befuddled ;(
>>>> >
>>>>
>>>> The reduce function is always returning a structure of the form:
>>>>
>>>> {
>>>>    "tag1": N1,
>>>>    "tag2": N2,
>>>>    ...
>>>> }
>>>>
>>>> When you get to rereduce=true, then the values array is an array of
>>>> the structures that your code needs to combine.
>>>>
>>>> And for berevity, the last bit of code outside the if statement is
>>>> just discarding all tags below the top N so that the growth of data
>>>> doesn't exceed the log(num_rows) rule.
I was unaware of this log(num_rows) rule, could you point me to a
reference about it in the docs?  Not being flip, as it would appear to
make sense from an optimization stand point.  However, since N must be
known and set by the developer at coding time, it doesn't appear to be
that useful.  I don't see where the map/reduce feature of couch, or in
general, ever tells the map/reduce function the total size of the
input space.  I thought you would always assume that it was unknown.

>>>>
>>>> HTH,
>>>> Paul Davis
>>>>
>>> That make sense from what I am reading.  thanks.
>>> Any idea about the python view server from couchdb-python?  The values tuple
being returned works just fine however, the keys tuple, which appears to be a tuple of tuples
is causing my reduce function to fail silently whenever I try to access an element by index.
>>>
>>> keys looks like [[tag, object_id],[tag, object_id],...]
>>>
>>> However when I try to access keys[0] (should be [tag, object_id]) or keys[0][0]
(should be 'tag') my reduce script silently fails.  If I just ignore tags and sum the values
in a simple map/reduce I get the correct counts as simple vectors or atleast the same answer
as the javascript equivalent.  I'm hoping CMLenz will join the discussion or someone point
me in the proper direction.
>>>
>>> Regards,
>>>
>>> Jeff
>>
>> Ok, answering my own post -- you bang on your keyboard long enough and
>> you just might keep up with the monkeys<g>
>>
>> I'm getting closer --
>> def reduce(keys,vals,rereduce):
>>  tags = {}
>>  if not rereduce:
>>    for i in range(len(keys)):
>>      tags[keys[i][0]] = tags.get(keys[i][0],0) + vals[i]
>>
>>  else:
>>    tags = vals[0]
>>    for i in range(1,len(vals)):
>>      return vals
>>      #tags[val[i][0]] += tags.get(val[i][0],0) + val[i][1]
>>
>>  return tags
>>
>
> Never used the python view server, but I think what you're wanting is
> something along the lines of:
>
> def reduce(keys, vals, rereduce):
>    N = 5 # N top tags to return
>    tags = {}
>    if(!rereduce):
>        for k, v in zip(keys, vals):
>            tags[k] += v
>    else:
>        tags = vals[0]
>        for obj in vals[1:]:
>            for k, v in obj.iteritems():
>                tags[k] = tags.get(k, 0) + v
>
>    keepers = list(tags.keys())
>    keepers.sort(key=lambda x: tags.get(x), reverse=True)
>    return dict([(k, tags[k]) for k in keepers[:N]);
>
You were quite close, here is my working code

def reduce(keys,vals,rereduce):
  tags = {}
  if not rereduce:
    for ks,v in zip(keys,vals): # zip([[tag0, object_id0],...], [v0,...])
      k0,k1 = ks  #ks=[tag, object_id]
      tags[k0] = tags.get(k0,0) + v

  else: # is a rereduce run
    for val in vals:  #keys is None on rereduce, vals=[iteralableObj0,...]
      for k,v in val.iteritems():
        tags[k] = tags.get(k,0) + v

  return tags

Some things I've noticed.
1) The javascript reduce code is not affected by changing the MAX
value.  The same results are returned for MAX=2 as MAX=32767, so the
above code is functionally the same as the js in the wiki. (I could be
misinterpreting what is intended by the js code.)
2) If the js reduce is suppose to return the Top N elements, from what
I currently know, I don't believe you can limit the number of results
returned as is intended since couchdb expects a  result for each view
element. Using the python from above, if you delete tags[k] in the
rereduce section
...
  else: # is a rereduce run
    for val in vals:  #keys is None on rereduce, vals=[iteralableObj0,...]
      for k,v in val.iteritems():
        tags[k] = tags.get(k,0) + v
        del tags[k]
...

This, as an experiment, would delete any element that required
re-reducing, results in the following output:
-key-             -value-
"cool"           {cool: 49}
"couchdb"   {couchdb: 46}
"hopeful"     {hopeful: 1}
"hot"             {}
"neat"          {neat: 50}
"python"	     {}

Showing that the number of view keys returned by couchdb is unaffected
by these types of attempted manipulations.  Only the view values are
being affected.  If the number of view elements can not be affected,
then retrieving the TopN results would require further processing by
the client.  If this is the case, then the simpler

def reduce(keys,values):
  return sum(values)

Would be more efficient.  However, what I am seeing may be a result of
the python view server implementation. But after reading, Damien's
blog post, http://damienkatz.net/2008/02/incremental_map_1.html, it
appears to be how it is.  (not that I have a problem with it, just
attempting to get my mental picture right ;))


So when, rereduce is True, you are operating on the output of your
rereduce=False operation(s)??
This being important when you return a value of a different type than
passed to the original primary reduce operation.??

When you are outputting the same type of value as you are getting in,
you don't care about the rereduce param ??


somewhat-more-enlightened-but-still-fog'ly,

Jeff

Mime
View raw message