couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Proposal: Review DBs
Date Wed, 22 Apr 2009 14:07:53 GMT
I would definitely +1 something like this.

I'm essentially do something like this "manually", with an update
notifier script, and I'd love to see it become a first-class feature
—instead of my special-case version of it.

On Wed, Apr 22, 2009 at 7:40 AM, Wout Mertens <wout.mertens@gmail.com> wrote:
> Intro
> =====
> How do you sort by reduce value? How do you join views? How do you get
> unique view results? How do you cache group key reduces?
>
> I think that with the below proposed solution all the above and more are
> possible. The general idea is to store view results and run map/reduce on
> them. There's been some discussions about this but they went nowhere. I've
> been thinking about this issue a bit and I think it can be done.
>
> I'd like to call this feature a Review DB.
>
> Use cases
> =========
> - Suppose you want to know what tags are most popular on your blog. Simply
> get:
>
>  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
>
> Where tags_by_count is a Review DB that gets input from the tagcount view
> and then runs the sort_by_value view on it, a map() function that simply
> emits (value,key).
>
> Likewise, show pages in order of popularity, whereby user can vote up (+1)
> or down (-1):
>
>  http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
>
> - Given documents with attributes title, date and tags. You'd like to know
> the minimum value of date and a breakdown by count for tags, for every
> title. Normally you'd use 2 map+reduce views, minimum_date_by_title and
> tagcount_by_title, which you would then query separately. With a Review DB,
> you can let both views insert their results in the database and then run a
> view that combines the results in one view:
>
>  http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
>
> - This is not a way to run an on-the-fly map/reduce on a subset of a view,
> like if you want to find the median popularity score of restaurants with
> "Tony" in their name that are close to you.
>
> Implementation
> ==============
> A Review DB is a hidden database maintained by CouchDB with these fields:
> - _id of document is the string representation of the key
> - "key" is the key of the incoming view row (unique)
> - "value" is the value of the incoming view row
>
> I hope that this is sufficiently like a normal view that it can be stored as
> a normal view. _id is just there to make it doc-compliant, it would be much
> better if "key" were the actual key.
>
> A Review DB is defined in a design document like normal views. Each review
> is an entry in the "reviews" hash, and has a "incoming_views" array that
> lists all the views that should insert results in the review db plus the
> group level, as well as a normal "views" hash for further map/reduce of the
> review db (and perhaps another "reviews" hash for further result
> processing?).
>
> Maintaining a database of results means that results have to be updated or
> even removed when documents change. I tried to make this work (in theory)
> for map-only views, but the resulting requirements are quite messy. You
> either need to cache the previous results of a view for each document, or
> you have to have an old version of the document available to regenerate
> those results.
>
> Therefore, a Review DB only accepts results from one or more map+reduce
> views. You define beforehand what the group_level of the keys is that will
> be inserted.
>
> Furthermore, a Review DB disallows (but doesn't enforce) having 2 views that
> generate the same keys. Otherwise, refcounting would need to be used and
> while that's not difficult, I think there's limited value in allowing this.
>
> The Review DB needs updating every time the reduction for a group key of one
> of the participating views gets updated. Even though a map+reduce view has
> unique keys, we need a refcount since we have multiple views. Whoever got to
> insert its value last wins.
>
> There is a slight complication: group key values are calculated on-the-fly
> from the view result b-tree. So whenever a reduce call results in a new
> value for a b-tree node, AND that node is the upper node of a subtree that
> is completely part of a group key, that group key needs to be marked for
> recalculation.
>
> Likewise, if deletion/addition of a b-tree node results in the
> removal/creation of the sole upper node of a group key subtree, that group
> key needs to be marked for removal/addition.
>
> This is the algorithm:
> - When a reducing view gets updated, and it is part of a Review DB, use the
> 2 paragraphs above to keep a list of group keys that need handling
> - After updating the reduce() results, for each of the marked group keys:
>  - If a group key gets removed:
>   - look up doc with key=group key in review db. If exists:
>     - delete doc
>  - If a group key gets added:
>   - look up doc with key=group key in review db. If exists:
>     - set doc.value to the row value
>   - else
>     - create doc with id=group key in string form, key=group key,
> value=value
>  - If a group key gets updated:
>   - look up doc with key=group key in review db. If exists:
>     - set doc.value to the row value
>   - else
>     - create doc with id=group key in string form, key=group key,
> value=value
>
> As you can see, this is something CouchDB should do since it knows when it's
> updating group key reduction values and it knows if this was an delete,
> update or addition.
>
> View updates are done when the view is called; Review updates are done at
> this time as well. Views on Review DBs are done when they are called.
>
> Summary
> =======
> Review DBs are a sort of view index that CouchDB can maintain with little
> overhead. It caches group key results and allows chained map+reduce
> calculations using mostly existing frameworks.
>
> I think this would be a very useful feature for CouchDB to have. There are
> regularly requests for storing view results in a database for
> post-processing on the mailing lists.
>
> I'm not saying this is a trivial change but it doesn't seem technically
> impossible to me either. (unless I missed something again; this is the 5th
> iteration of this proposal. Anyway I know *I* wouldn't be able to code this
> :-) )
>
> What do you think, oh dear devs?
>
> Wout.
>

Mime
View raw message