Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of truemped@googlemail.com
 designates 72.14.220.153 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=content-type:mime-version:subject:from:in-reply-to:date
         :content-transfer-encoding:message-id:references:to:x-mailer;
        b=JAFN6sRW/2mQGgeB62cXa45Q10JTupdMC9/Ep6U4eBZBm6jwTq5GcONGOifc7sZBMP
         YpGNOSeZPIRVApACJMOXxqawfHtn+0NRUh9NWJuwDwUE94mTIm9h6qXqIFV0d/jAwhcr
         w54dmRf/YXBWgBRy0Zt5CBd+YZdM7DRg3EWGI=
Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes
Mime-Version: 1.0 (Apple Message framework v1076)
Subject: Re: Sorting items by number of votes
From: Daniel Truemper <truemped@googlemail.com>
In-Reply-To: <254CF962-CC02-4FEA-9A60-ECF2E71CA2D2@devonweller.com>
Date: Thu, 5 Nov 2009 18:43:17 +0100
Content-Transfer-Encoding: 7bit
Message-Id: <B9B4BFDF-EA7A-4D1D-B156-FD6ED652B38E@googlemail.com>
References: <3E00739B-2973-4A13-AF4D-9992A39A6527@devonweller.com>
 <ea28b75b0911050604p20dcd1f2ldf28dc61a719d96c@mail.gmail.com>
 <254CF962-CC02-4FEA-9A60-ECF2E71CA2D2@devonweller.com>
To: user@couchdb.apache.org

Hi,

> Using a list is an interesting idea.  Although, I suspect that  
> method would become inefficient for things like "give me the 10  
> resources with the most votes" when there are 10,000 resources in  
> the database.
hm, I don't think that a list is the appropriate way to go here since  
a list is based on a view again (AFAIK).

> I think my solution will be to create a map reduce which just counts  
> the votes by resource_id.  And then use that information to do a  
> bulk request for the top 10 documents by ID.

> Regarding the example in the wiki:
>
> As a new user, I just followed an example from the wiki: http://wiki.apache.org/couchdb/View_Snippets#aggregate_sum
>
> Is that example an incorrect way of using a CouchDB view?  Should it  
> be removed?
No, it is not. If you step through it you will notice what happens.  
Here is the code again:

// Map function
function(doc) {
   if (doc.Type == "customer") {
     emit([doc._id, 0], doc);
   } else if (doc.Type == "order") {
     emit([doc.customer_id, 1], doc);
   }
}

// Reduce function
// Only produces meaningful output.customer_details if group_level >= 1
function(keys, values, rereduce) {
   var output = {};
   if (rereduce) {
     for (idx in values) {
       if (values[idx].total !== undefined) {
         output.total += values[idx].total;
       } else if (values[idx].customer_details !== undefined) {
         output.customer_details = values[idx].customer_details;
       }
     }
   } else {
     for (idx in values) {
       if (values[idx].Type == "customer") output.customer_details =  
doc;
       else if (values[idx].Type == "order") output.total += 1;
     }
   }
   return output;
}


1. The map function emits a compound key that is [doc._id, 0] for  
customers and [doc.customer_id, 1] for orders. So in the case of 2  
customers with 3 orders each the resulting tree looks like the  
following:

key, value

[ 1, 0 ], customer1
[ 1, 1 ], customer1order
[ 1, 1 ], customer1order
[ 1, 1 ], customer1order
[ 2, 0 ], customer2
[ 2, 1 ], customer2order
[ 2, 1 ], customer2order
[ 2, 1 ], customer2order

with customer1 having id 1 and customer2 having id 2.


2. In the reduce function are 2 different kinds of reduction. A small  
note: you should call the view with group_level=1 otherwise you won't  
get aggregated results!

The basic operation of the reduce function is the second part:

     for (idx in values) {
       if (values[idx].Type == "customer") output.customer_details =  
doc;
       else if (values[idx].Type == "order") output.total += 1;
     }

What happens is that it iterates through all key/value pairs where the  
key begins with [1]. So:

[ 1, 0 ], customer1
[ 1, 1 ], customer1order
[ 1, 1 ], customer1order
[ 1, 1 ], customer1order

So for the first entry (a customer) the output object is filled with  
the customer doc. For all orders a counter inside the output object is  
increased. So in the end the following would be returned:

{
   customer_details = customer1,
   total = 3
}

So: input is a list of 4 values, the output only contains 1.


3. The second case of the reduce function deals with the phase where  
there are so many orders that CouchDB internally stores mid-values of  
not all orders with customers in the same tree bucket.

Example:

            top
         /         \
       / \          / \
      /   \        /   \
   a     b     c     d

a = [ 1, 0 ], customer1
b = [ 1, 1 ], customer1order
c = [ 1, 1 ], customer1order
d = [ 1, 1 ], customer1order

So if CouchDB internally stores mid-values of a+b and c+d you will  
have the two output objects:

{
   customer_details = customer1,
   total = 1
}

{
   total = 2
}

These two values are now used to rereduce:

     for (idx in values) {
       if (values[idx].total !== undefined) {
         output.total += values[idx].total;
       } else if (values[idx].customer_details !== undefined) {
         output.customer_details = values[idx].customer_details;
       }
     }

So in the end you again have the above value:

{
   customer_details = customer1,
   total = 3
}


>>> Here is my reduce function:
>>>
>>> function(keys, values, rereduce) {
>>>      var score = 0;
>>>      var output_doc = {};
>>>
>>>      for (var i=0; i < values.length; i++) {
>>>              if (values[i].type == 'vote') {
>>>                      ++score;
>>>              } else if (values[i].type == 'resource') {
>>>                      output_doc = values[i];
>>>              }
>>>      }
>>>
>>>      return {doc:output_doc, score:score};
>>> }

First I think you need to implement the rereduce phase, otherwise you  
will get wrong numbers with large amounts of data.

 From looking at your reduce function I seem to remember that the  
error message is based on some byte length difference between the  
incoming and outgoing value of the reduce function. So if the incoming  
values only contain one very large resource document and several  
smaller votes, the fact that you are returning the resource document  
might get in your way here. So I think it would be better if you would  
only store the document id and get that document in a separate call to  
the db.


And a little side note: at the moment you cannot order the view based  
on the value! Ordering is only done by keys!

You could however write another type of document (VoteCount) into your  
database containing the resource and the number of votes. Then  
emitting as key something like [ #votes, resource ] will give you an  
ordered view based on the number of votes. You could trigger the view  
update from the client each time a vote is made (i.e. add a vote  
document, call the view, update the VoteCount document and call the  
new view to get the ordered votes). You could also do this  
automatically on the CouchDB using update notifiers and simple Bash/ 
Python/Perl/whatever scripts...

HTH
Daniel