couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Proposal: Review DBs
Date Wed, 22 Apr 2009 18:31:29 GMT
Hi Zachary, something like that.  The more I think about the problem  
the more I converge on a solution like what Wout has proposed.  Some  
quick thoughts:

* Remapping the output of a map isn't terribly useful.  All the power  
is in remapping the output of a reduction.

* Incremental generation of some complex multi-MR view still requires  
persisting the output of each step individually, even if you're only  
interested in the final result.  At least, I don't yet see a clever  
way around it.

* Dumping the output of the first MR(s) into a Review DB is an easy  
way to take advantage of code that's already written, but it's a bit  
wasteful.  We could just take the results directly from the view  
btree(s) and send them to the next step in the workflow.

* I'm not yet sold on the HTTP API in Wout's proposal.  I think I'd  
prefer to keep the existing API, and in the _design doc specify the  
full workflow required to generate a given view.

Cheers, Adam

On Apr 22, 2009, at 10:53 AM, Zachary Zolton wrote:

> Such as having view definition, in the design doc, contain an array of
> objects, each with the map/reduce function pair attributes?
>
> On Wed, Apr 22, 2009 at 9:48 AM, Adam Kocoloski  
> <kocolosk@apache.org> wrote:
>> Hi Wout, thanks for writing this up.
>>
>> One comment about the map-only views:  I think you'll find that  
>> Couch has
>> already done a good bit of the work needed to support them, too.   
>> Couch
>> maintains a btree for each design doc keyed on docid that stores  
>> all the
>> view keys emitted by the maps over each document.  When a document is
>> updated and then analyzed, Couch has to consult that btree, purge  
>> all the
>> KVs associated with the old version of the doc from each view, and  
>> then
>> insert the new KVs.  So the tracking information correlating docids  
>> and view
>> keys is already available.
>>
>> You'd still be left with the problem of generating unique docids  
>> for the
>> documents in the Review DB, but I think that's a problem that needs  
>> to be
>> solved.  The restriction to only MR views with no duplicate keys  
>> across
>> views seems too strong to me.
>>
>> With that said, I'd prefer to spend my time extending the view  
>> engine to
>> handle chainable MR workflows in a single shot.  Especially in the  
>> simple
>> sort_by_value case it just seems like a cleaner way to go about  
>> things.
>>  Cheers,
>>
>> Adam
>>
>> On Apr 22, 2009, at 8:40 AM, Wout Mertens wrote:
>>
>>> Intro
>>> =====
>>> How do you sort by reduce value? How do you join views? How do you  
>>> get
>>> unique view results? How do you cache group key reduces?
>>>
>>> I think that with the below proposed solution all the above and  
>>> more are
>>> possible. The general idea is to store view results and run map/ 
>>> reduce on
>>> them. There's been some discussions about this but they went  
>>> nowhere. I've
>>> been thinking about this issue a bit and I think it can be done.
>>>
>>> I'd like to call this feature a Review DB.
>>>
>>> Use cases
>>> =========
>>> - Suppose you want to know what tags are most popular on your  
>>> blog. Simply
>>> get:
>>>
>>>
>>>  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
>>>
>>> Where tags_by_count is a Review DB that gets input from the  
>>> tagcount view
>>> and then runs the sort_by_value view on it, a map() function that  
>>> simply
>>> emits (value,key).
>>>
>>> Likewise, show pages in order of popularity, whereby user can vote  
>>> up (+1)
>>> or down (-1):
>>>
>>>  http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
>>>
>>> - Given documents with attributes title, date and tags. You'd like  
>>> to know
>>> the minimum value of date and a breakdown by count for tags, for  
>>> every
>>> title. Normally you'd use 2 map+reduce views,  
>>> minimum_date_by_title and
>>> tagcount_by_title, which you would then query separately. With a  
>>> Review DB,
>>> you can let both views insert their results in the database and  
>>> then run a
>>> view that combines the results in one view:
>>>
>>>
>>>  http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
>>>
>>> - This is not a way to run an on-the-fly map/reduce on a subset of  
>>> a view,
>>> like if you want to find the median popularity score of  
>>> restaurants with
>>> "Tony" in their name that are close to you.
>>>
>>> Implementation
>>> ==============
>>> A Review DB is a hidden database maintained by CouchDB with these  
>>> fields:
>>> - _id of document is the string representation of the key
>>> - "key" is the key of the incoming view row (unique)
>>> - "value" is the value of the incoming view row
>>>
>>> I hope that this is sufficiently like a normal view that it can be  
>>> stored
>>> as a normal view. _id is just there to make it doc-compliant, it  
>>> would be
>>> much better if "key" were the actual key.
>>>
>>> A Review DB is defined in a design document like normal views.  
>>> Each review
>>> is an entry in the "reviews" hash, and has a "incoming_views"  
>>> array that
>>> lists all the views that should insert results in the review db  
>>> plus the
>>> group level, as well as a normal "views" hash for further map/ 
>>> reduce of the
>>> review db (and perhaps another "reviews" hash for further result
>>> processing?).
>>>
>>> Maintaining a database of results means that results have to be  
>>> updated or
>>> even removed when documents change. I tried to make this work (in  
>>> theory)
>>> for map-only views, but the resulting requirements are quite  
>>> messy. You
>>> either need to cache the previous results of a view for each  
>>> document, or
>>> you have to have an old version of the document available to  
>>> regenerate
>>> those results.
>>>
>>> Therefore, a Review DB only accepts results from one or more map 
>>> +reduce
>>> views. You define beforehand what the group_level of the keys is  
>>> that will
>>> be inserted.
>>>
>>> Furthermore, a Review DB disallows (but doesn't enforce) having 2  
>>> views
>>> that generate the same keys. Otherwise, refcounting would need to  
>>> be used
>>> and while that's not difficult, I think there's limited value in  
>>> allowing
>>> this.
>>>
>>> The Review DB needs updating every time the reduction for a group  
>>> key of
>>> one of the participating views gets updated. Even though a map 
>>> +reduce view
>>> has unique keys, we need a refcount since we have multiple views.  
>>> Whoever
>>> got to insert its value last wins.
>>>
>>> There is a slight complication: group key values are calculated on- 
>>> the-fly
>>> from the view result b-tree. So whenever a reduce call results in  
>>> a new
>>> value for a b-tree node, AND that node is the upper node of a  
>>> subtree that
>>> is completely part of a group key, that group key needs to be  
>>> marked for
>>> recalculation.
>>>
>>> Likewise, if deletion/addition of a b-tree node results in the
>>> removal/creation of the sole upper node of a group key subtree,  
>>> that group
>>> key needs to be marked for removal/addition.
>>>
>>> This is the algorithm:
>>> - When a reducing view gets updated, and it is part of a Review  
>>> DB, use
>>> the 2 paragraphs above to keep a list of group keys that need  
>>> handling
>>> - After updating the reduce() results, for each of the marked  
>>> group keys:
>>> - If a group key gets removed:
>>>  - look up doc with key=group key in review db. If exists:
>>>    - delete doc
>>> - If a group key gets added:
>>>  - look up doc with key=group key in review db. If exists:
>>>    - set doc.value to the row value
>>>  - else
>>>    - create doc with id=group key in string form, key=group key,
>>> value=value
>>> - If a group key gets updated:
>>>  - look up doc with key=group key in review db. If exists:
>>>    - set doc.value to the row value
>>>  - else
>>>    - create doc with id=group key in string form, key=group key,
>>> value=value
>>>
>>> As you can see, this is something CouchDB should do since it knows  
>>> when
>>> it's updating group key reduction values and it knows if this was  
>>> an delete,
>>> update or addition.
>>>
>>> View updates are done when the view is called; Review updates are  
>>> done at
>>> this time as well. Views on Review DBs are done when they are  
>>> called.
>>>
>>> Summary
>>> =======
>>> Review DBs are a sort of view index that CouchDB can maintain with  
>>> little
>>> overhead. It caches group key results and allows chained map+reduce
>>> calculations using mostly existing frameworks.
>>>
>>> I think this would be a very useful feature for CouchDB to have.  
>>> There are
>>> regularly requests for storing view results in a database for
>>> post-processing on the mailing lists.
>>>
>>> I'm not saying this is a trivial change but it doesn't seem  
>>> technically
>>> impossible to me either. (unless I missed something again; this is  
>>> the 5th
>>> iteration of this proposal. Anyway I know *I* wouldn't be able to  
>>> code this
>>> :-) )
>>>
>>> What do you think, oh dear devs?
>>>
>>> Wout.
>>
>>


Mime
View raw message