couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Proposal: Review DBs
Date Wed, 22 Apr 2009 14:11:18 GMT
This also sounds related to the "View Intersections" proposal
discussed earlier...

http://mail-archives.apache.org/mod_mbox/couchdb-dev/200904.mbox/%3c011A0D62-06C2-490B-A4C4-7EEF0203B6C3@gmail.com%3e


On Wed, Apr 22, 2009 at 9:07 AM, Zachary Zolton
<zachary.zolton@gmail.com> wrote:
> I would definitely +1 something like this.
>
> I'm essentially do something like this "manually", with an update
> notifier script, and I'd love to see it become a first-class feature
> —instead of my special-case version of it.
>
> On Wed, Apr 22, 2009 at 7:40 AM, Wout Mertens <wout.mertens@gmail.com> wrote:
>> Intro
>> =====
>> How do you sort by reduce value? How do you join views? How do you get
>> unique view results? How do you cache group key reduces?
>>
>> I think that with the below proposed solution all the above and more are
>> possible. The general idea is to store view results and run map/reduce on
>> them. There's been some discussions about this but they went nowhere. I've
>> been thinking about this issue a bit and I think it can be done.
>>
>> I'd like to call this feature a Review DB.
>>
>> Use cases
>> =========
>> - Suppose you want to know what tags are most popular on your blog. Simply
>> get:
>>
>>  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
>>
>> Where tags_by_count is a Review DB that gets input from the tagcount view
>> and then runs the sort_by_value view on it, a map() function that simply
>> emits (value,key).
>>
>> Likewise, show pages in order of popularity, whereby user can vote up (+1)
>> or down (-1):
>>
>>  http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
>>
>> - Given documents with attributes title, date and tags. You'd like to know
>> the minimum value of date and a breakdown by count for tags, for every
>> title. Normally you'd use 2 map+reduce views, minimum_date_by_title and
>> tagcount_by_title, which you would then query separately. With a Review DB,
>> you can let both views insert their results in the database and then run a
>> view that combines the results in one view:
>>
>>  http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
>>
>> - This is not a way to run an on-the-fly map/reduce on a subset of a view,
>> like if you want to find the median popularity score of restaurants with
>> "Tony" in their name that are close to you.
>>
>> Implementation
>> ==============
>> A Review DB is a hidden database maintained by CouchDB with these fields:
>> - _id of document is the string representation of the key
>> - "key" is the key of the incoming view row (unique)
>> - "value" is the value of the incoming view row
>>
>> I hope that this is sufficiently like a normal view that it can be stored as
>> a normal view. _id is just there to make it doc-compliant, it would be much
>> better if "key" were the actual key.
>>
>> A Review DB is defined in a design document like normal views. Each review
>> is an entry in the "reviews" hash, and has a "incoming_views" array that
>> lists all the views that should insert results in the review db plus the
>> group level, as well as a normal "views" hash for further map/reduce of the
>> review db (and perhaps another "reviews" hash for further result
>> processing?).
>>
>> Maintaining a database of results means that results have to be updated or
>> even removed when documents change. I tried to make this work (in theory)
>> for map-only views, but the resulting requirements are quite messy. You
>> either need to cache the previous results of a view for each document, or
>> you have to have an old version of the document available to regenerate
>> those results.
>>
>> Therefore, a Review DB only accepts results from one or more map+reduce
>> views. You define beforehand what the group_level of the keys is that will
>> be inserted.
>>
>> Furthermore, a Review DB disallows (but doesn't enforce) having 2 views that
>> generate the same keys. Otherwise, refcounting would need to be used and
>> while that's not difficult, I think there's limited value in allowing this.
>>
>> The Review DB needs updating every time the reduction for a group key of one
>> of the participating views gets updated. Even though a map+reduce view has
>> unique keys, we need a refcount since we have multiple views. Whoever got to
>> insert its value last wins.
>>
>> There is a slight complication: group key values are calculated on-the-fly
>> from the view result b-tree. So whenever a reduce call results in a new
>> value for a b-tree node, AND that node is the upper node of a subtree that
>> is completely part of a group key, that group key needs to be marked for
>> recalculation.
>>
>> Likewise, if deletion/addition of a b-tree node results in the
>> removal/creation of the sole upper node of a group key subtree, that group
>> key needs to be marked for removal/addition.
>>
>> This is the algorithm:
>> - When a reducing view gets updated, and it is part of a Review DB, use the
>> 2 paragraphs above to keep a list of group keys that need handling
>> - After updating the reduce() results, for each of the marked group keys:
>>  - If a group key gets removed:
>>   - look up doc with key=group key in review db. If exists:
>>     - delete doc
>>  - If a group key gets added:
>>   - look up doc with key=group key in review db. If exists:
>>     - set doc.value to the row value
>>   - else
>>     - create doc with id=group key in string form, key=group key,
>> value=value
>>  - If a group key gets updated:
>>   - look up doc with key=group key in review db. If exists:
>>     - set doc.value to the row value
>>   - else
>>     - create doc with id=group key in string form, key=group key,
>> value=value
>>
>> As you can see, this is something CouchDB should do since it knows when it's
>> updating group key reduction values and it knows if this was an delete,
>> update or addition.
>>
>> View updates are done when the view is called; Review updates are done at
>> this time as well. Views on Review DBs are done when they are called.
>>
>> Summary
>> =======
>> Review DBs are a sort of view index that CouchDB can maintain with little
>> overhead. It caches group key results and allows chained map+reduce
>> calculations using mostly existing frameworks.
>>
>> I think this would be a very useful feature for CouchDB to have. There are
>> regularly requests for storing view results in a database for
>> post-processing on the mailing lists.
>>
>> I'm not saying this is a trivial change but it doesn't seem technically
>> impossible to me either. (unless I missed something again; this is the 5th
>> iteration of this proposal. Anyway I know *I* wouldn't be able to code this
>> :-) )
>>
>> What do you think, oh dear devs?
>>
>> Wout.
>>
>

Mime
View raw message