couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wout Mertens <wout.mert...@gmail.com>
Subject Re: Proposal: Review DBs
Date Tue, 28 Apr 2009 12:13:58 GMT
On Apr 28, 2009, at 11:42 AM, Brian Candler wrote:

> On Sun, Apr 26, 2009 at 11:26:33PM +0200, Wout Mertens wrote:
>> In a nutshell, I'm hoping that:
>> * A review is a new sort of view that has an "inputs" array in its
>> definition.
>> * Only MR views are allowed as inputs, no KV duplication allowed.
>> * It builds a persistent index of the incoming views when those get
>> updated.
>> * That index is then used to build the view index for the review when
>> the review gets updated.
>> * I think I covered the most important algorithms needed to implement
>> this in my original proposal.
>>
>> Does this sound feasible? If so I'll update my proposal accordingly.
>
> Could you define a bit more where the "inputs" array comes from?

Oh right, I didn't explain that well, and I never said how to add the  
group_level. Here's an example for counting the number of unique names  
in a database:

{
   "_id": "_design/foo",
   "views": {
       "unique_names": {
           "map": "function(doc) {emit(doc.name, 1);}",
           "reduce": "function(keys, values, rereduce) {return  
sum(values);}"
      }
       "unique_names_count": {
           "inputs": {"unique_names":1},
           "map": "function(doc) {emit(doc.key, 1);}",
           "reduce": "function(keys, values, rereduce) {return  
sum(values);}"
      }
   }
}

In this case, unique_names_count is the review db. It gets input from  
the unique_names view with group_level set to 1. Note that it's a hash  
since each view only gets added once and each view has a unique name.

(BTW, I can't figure out what to set group_level to if I want the  
equivalent of group=true. group_level=exact doesn't work although  
group_level=1000 might be good enough. How about allowing  
group_level=-1?)

(Also, in this case just "unique_names_count": {"inputs":  
{"unique_names":1}} would suffice since getting that view with limit=0  
will give you the total_rows number which is what was needed)

> Furthermore, if I understand rightly, these grouped reduce values  
> are *not*
> persisted anywhere (*). That is: every time you do a  
> reduce=true&group=true
> query then the entire map index is scanned for distinct keys and  
> then each
> set of values with equal keys is passed afresh to the reduce function.
>
> This means that it won't be possible to do incremental changes to  
> the review
> DB, since these grouped keys and reduce values aren't associated  
> with the
> docids they came from. You'd have to calculate it all from scratch  
> every
> time, in which case you might as well just get the client to do it.  
> AFAICS,
> CouchDB could cache the result only when the database is not  
> receiving any
> updates.

You're correct in that group reduce values are not persisted, however  
the algorithm I proposed can do incremental updates nonetheless:
=========
Group key values are calculated on-the-fly from the view result b- 
tree. So whenever a reduce call results in a new value for a b-tree  
node, AND that node is the upper node of a subtree that is completely  
part of a group key, that group key needs to be marked for  
recalculation.

Likewise, if deletion/addition of a b-tree node results in the removal/ 
creation of the sole upper node of a group key subtree, that group key  
needs to be marked for removal/addition.

This is the algorithm:
- When a reducing view gets updated, and it is part of a Review DB,  
use the 2 paragraphs above to keep a list of group keys that need  
handling
- After updating the reduce() results, for each of the marked group  
keys:
- If a group key gets removed:
   - look up doc with key=group key in review db. If exists:
     - delete doc
- If a group key gets added:
   - look up doc with key=group key in review db. If exists:
     - set doc.value to the row value
   - else
     - create doc with id=group key in string form, key=group key,  
value=value
- If a group key gets updated:
   - look up doc with key=group key in review db. If exists:
     - set doc.value to the row value
   - else
     - create doc with id=group key in string form, key=group key,  
value=value
========
Is that explanation clear?

I'm imagining that at any point on the b-tree where a reduce value is  
received, CouchDB can see if that node is one of the top nodes for a  
group key and can mark that group key somehow. I think that should be  
feasible (although I admit I've never tried to grok the code).

Cheers,

Wout.

Mime
View raw message