couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Proposal: Review DBs
Date Wed, 22 Apr 2009 14:53:37 GMT
Such as having view definition, in the design doc, contain an array of
objects, each with the map/reduce function pair attributes?

On Wed, Apr 22, 2009 at 9:48 AM, Adam Kocoloski <kocolosk@apache.org> wrote:
> Hi Wout, thanks for writing this up.
>
> One comment about the map-only views:  I think you'll find that Couch has
> already done a good bit of the work needed to support them, too.  Couch
> maintains a btree for each design doc keyed on docid that stores all the
> view keys emitted by the maps over each document.  When a document is
> updated and then analyzed, Couch has to consult that btree, purge all the
> KVs associated with the old version of the doc from each view, and then
> insert the new KVs.  So the tracking information correlating docids and view
> keys is already available.
>
> You'd still be left with the problem of generating unique docids for the
> documents in the Review DB, but I think that's a problem that needs to be
> solved.  The restriction to only MR views with no duplicate keys across
> views seems too strong to me.
>
> With that said, I'd prefer to spend my time extending the view engine to
> handle chainable MR workflows in a single shot.  Especially in the simple
> sort_by_value case it just seems like a cleaner way to go about things.
>  Cheers,
>
> Adam
>
> On Apr 22, 2009, at 8:40 AM, Wout Mertens wrote:
>
>> Intro
>> =====
>> How do you sort by reduce value? How do you join views? How do you get
>> unique view results? How do you cache group key reduces?
>>
>> I think that with the below proposed solution all the above and more are
>> possible. The general idea is to store view results and run map/reduce on
>> them. There's been some discussions about this but they went nowhere. I've
>> been thinking about this issue a bit and I think it can be done.
>>
>> I'd like to call this feature a Review DB.
>>
>> Use cases
>> =========
>> - Suppose you want to know what tags are most popular on your blog. Simply
>> get:
>>
>>
>>  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
>>
>> Where tags_by_count is a Review DB that gets input from the tagcount view
>> and then runs the sort_by_value view on it, a map() function that simply
>> emits (value,key).
>>
>> Likewise, show pages in order of popularity, whereby user can vote up (+1)
>> or down (-1):
>>
>>  http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
>>
>> - Given documents with attributes title, date and tags. You'd like to know
>> the minimum value of date and a breakdown by count for tags, for every
>> title. Normally you'd use 2 map+reduce views, minimum_date_by_title and
>> tagcount_by_title, which you would then query separately. With a Review DB,
>> you can let both views insert their results in the database and then run a
>> view that combines the results in one view:
>>
>>
>>  http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
>>
>> - This is not a way to run an on-the-fly map/reduce on a subset of a view,
>> like if you want to find the median popularity score of restaurants with
>> "Tony" in their name that are close to you.
>>
>> Implementation
>> ==============
>> A Review DB is a hidden database maintained by CouchDB with these fields:
>> - _id of document is the string representation of the key
>> - "key" is the key of the incoming view row (unique)
>> - "value" is the value of the incoming view row
>>
>> I hope that this is sufficiently like a normal view that it can be stored
>> as a normal view. _id is just there to make it doc-compliant, it would be
>> much better if "key" were the actual key.
>>
>> A Review DB is defined in a design document like normal views. Each review
>> is an entry in the "reviews" hash, and has a "incoming_views" array that
>> lists all the views that should insert results in the review db plus the
>> group level, as well as a normal "views" hash for further map/reduce of the
>> review db (and perhaps another "reviews" hash for further result
>> processing?).
>>
>> Maintaining a database of results means that results have to be updated or
>> even removed when documents change. I tried to make this work (in theory)
>> for map-only views, but the resulting requirements are quite messy. You
>> either need to cache the previous results of a view for each document, or
>> you have to have an old version of the document available to regenerate
>> those results.
>>
>> Therefore, a Review DB only accepts results from one or more map+reduce
>> views. You define beforehand what the group_level of the keys is that will
>> be inserted.
>>
>> Furthermore, a Review DB disallows (but doesn't enforce) having 2 views
>> that generate the same keys. Otherwise, refcounting would need to be used
>> and while that's not difficult, I think there's limited value in allowing
>> this.
>>
>> The Review DB needs updating every time the reduction for a group key of
>> one of the participating views gets updated. Even though a map+reduce view
>> has unique keys, we need a refcount since we have multiple views. Whoever
>> got to insert its value last wins.
>>
>> There is a slight complication: group key values are calculated on-the-fly
>> from the view result b-tree. So whenever a reduce call results in a new
>> value for a b-tree node, AND that node is the upper node of a subtree that
>> is completely part of a group key, that group key needs to be marked for
>> recalculation.
>>
>> Likewise, if deletion/addition of a b-tree node results in the
>> removal/creation of the sole upper node of a group key subtree, that group
>> key needs to be marked for removal/addition.
>>
>> This is the algorithm:
>> - When a reducing view gets updated, and it is part of a Review DB, use
>> the 2 paragraphs above to keep a list of group keys that need handling
>> - After updating the reduce() results, for each of the marked group keys:
>> - If a group key gets removed:
>>  - look up doc with key=group key in review db. If exists:
>>    - delete doc
>> - If a group key gets added:
>>  - look up doc with key=group key in review db. If exists:
>>    - set doc.value to the row value
>>  - else
>>    - create doc with id=group key in string form, key=group key,
>> value=value
>> - If a group key gets updated:
>>  - look up doc with key=group key in review db. If exists:
>>    - set doc.value to the row value
>>  - else
>>    - create doc with id=group key in string form, key=group key,
>> value=value
>>
>> As you can see, this is something CouchDB should do since it knows when
>> it's updating group key reduction values and it knows if this was an delete,
>> update or addition.
>>
>> View updates are done when the view is called; Review updates are done at
>> this time as well. Views on Review DBs are done when they are called.
>>
>> Summary
>> =======
>> Review DBs are a sort of view index that CouchDB can maintain with little
>> overhead. It caches group key results and allows chained map+reduce
>> calculations using mostly existing frameworks.
>>
>> I think this would be a very useful feature for CouchDB to have. There are
>> regularly requests for storing view results in a database for
>> post-processing on the mailing lists.
>>
>> I'm not saying this is a trivial change but it doesn't seem technically
>> impossible to me either. (unless I missed something again; this is the 5th
>> iteration of this proposal. Anyway I know *I* wouldn't be able to code this
>> :-) )
>>
>> What do you think, oh dear devs?
>>
>> Wout.
>
>

Mime
View raw message