couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Proposal: Review DBs
Date Tue, 28 Apr 2009 13:46:54 GMT
On Apr 28, 2009, at 8:13 AM, Wout Mertens wrote:

> On Apr 28, 2009, at 11:42 AM, Brian Candler wrote:
>
>> Furthermore, if I understand rightly, these grouped reduce values  
>> are *not*
>> persisted anywhere (*). That is: every time you do a  
>> reduce=true&group=true
>> query then the entire map index is scanned for distinct keys and  
>> then each
>> set of values with equal keys is passed afresh to the reduce  
>> function.
>>
>> This means that it won't be possible to do incremental changes to  
>> the review
>> DB, since these grouped keys and reduce values aren't associated  
>> with the
>> docids they came from. You'd have to calculate it all from scratch  
>> every
>> time, in which case you might as well just get the client to do it.  
>> AFAICS,
>> CouchDB could cache the result only when the database is not  
>> receiving any
>> updates.
>
> You're correct in that group reduce values are not persisted,  
> however the algorithm I proposed can do incremental updates  
> nonetheless:
> =========
> Group key values are calculated on-the-fly from the view result b- 
> tree. So whenever a reduce call results in a new value for a b-tree  
> node, AND that node is the upper node of a subtree that is  
> completely part of a group key, that group key needs to be marked  
> for recalculation.
>
> Likewise, if deletion/addition of a b-tree node results in the  
> removal/creation of the sole upper node of a group key subtree, that  
> group key needs to be marked for removal/addition.
>
> This is the algorithm:
> - When a reducing view gets updated, and it is part of a Review DB,  
> use the 2 paragraphs above to keep a list of group keys that need  
> handling
> - After updating the reduce() results, for each of the marked group  
> keys:
> - If a group key gets removed:
>  - look up doc with key=group key in review db. If exists:
>    - delete doc
> - If a group key gets added:
>  - look up doc with key=group key in review db. If exists:
>    - set doc.value to the row value
>  - else
>    - create doc with id=group key in string form, key=group key,  
> value=value
> - If a group key gets updated:
>  - look up doc with key=group key in review db. If exists:
>    - set doc.value to the row value
>  - else
>    - create doc with id=group key in string form, key=group key,  
> value=value
> ========
> Is that explanation clear?
>
> I'm imagining that at any point on the b-tree where a reduce value  
> is received, CouchDB can see if that node is one of the top nodes  
> for a group key and can mark that group key somehow. I think that  
> should be feasible (although I admit I've never tried to grok the  
> code).
>
> Cheers,
>
> Wout.

I'm not an expert on the btree code, but I think this statement

> So whenever a reduce call results in a new value for a b-tree node,  
> AND that node is the upper node of a subtree that is completely part  
> of a group key, that group key needs to be marked for recalculation.


misses some updates.  I thought things looked something like

           root
             |
      ---------------
     kp1           kp2
      |             |
  ---------     ---------
foo foo bar   bar bar bar

where foo and bar are emitted keys from a map (values not show for the  
sake of brevity).  kp1 and kp2 hold reduce(foo,foo,bar) and  
reduce(bar,bar,bar) respectively, and root holds rereduce(kp1,kp2).

When we request a view with group=true, Couch is smart enough to use  
the reduction stored in kp2 directly, but it has to call  
reduce(foo,foo) and reduce(bar) on the fly for the nodes underneath  
kp1 (and then rereduce the bar reductions from kp1 and kp2 to get the  
result for bar).  If I interpreted Wout's statement correctly, it  
ignores the case where any of the nodes under kp1 change, since kp1 is  
not "the upper node of a subtree that is completely part of a group  
key".

I think the problem of tracking which group keys to update can be made  
simpler.  We really only need to see the key updates coming out of the  
incremental map.  Couch knows the lists of keys to add and keys to  
remove at that point; the set of all unique keys in those two lists  
(the "changeset") is the set of group keys that would need to be  
updated in the Review DB.  Here I'm using "update" in the general  
sense of add/remove/change; the way I see it, we could just query the  
view for all the keys in the changeset.  If the MR view has no results  
for a given key, that obviously means delete the associated document  
from the Review DB.

group_levels are a straightforward extension -- we just check what  
group_level we'll be using in the Review DB when we calculate the  
changeset.

Regards, Adam


Mime
View raw message