couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Proposal: Review DBs
Date Thu, 23 Apr 2009 00:52:10 GMT
Durring #couchhack we kicked around some ideas on how to get chainable
map reduce. I came in angling on a full work flow specified in a
single design doc that would basically be the current view API but on
sterroids.

The other idea that was presented fairly convincingly was this idea of
persisting the output of views into a second DB and then running more
map/reduce operations on the derived database. The main characteristic
of this method would be that it could be implemented in an external
library (though for efficiency would probably be Erlang). The main
idea was that there would be little if any alteration to the current
view mechanism.

Wout's proposal definitely shares alot of characteristics with the
second proposal. I would argue as Adam does that changing the HTTP API
or even adding to it just creates more confusion. With the recent
release of Hovercraft we have a great point to start adding the
necessary view bits to actually do all of this persistence.

I'm still kicking around ideas on how I might approach such an
implementation. There are some obvious downfalls that Wout has come up
on. Specifically, some obvious implementations would place some
arduous constraints on emitted keys (namely: uniqueness, type must be
string, string must not start with an underscore).

So there are some things to contemplate, but I think the general idea
is pretty solid. Also as we start putting some serious effort into
clustering CouchDB there's the eyebrow raising aspect that if we
persist to DB's we might be able to leverage a lot of that for some
added awesomeness.

Anyway, those are just some raw thoughts.

Paul Davis

On Wed, Apr 22, 2009 at 2:31 PM, Adam Kocoloski <kocolosk@apache.org> wrote:
> Hi Zachary, something like that.  The more I think about the problem the
> more I converge on a solution like what Wout has proposed.  Some quick
> thoughts:
>
> * Remapping the output of a map isn't terribly useful.  All the power is in
> remapping the output of a reduction.
>
> * Incremental generation of some complex multi-MR view still requires
> persisting the output of each step individually, even if you're only
> interested in the final result.  At least, I don't yet see a clever way
> around it.
>
> * Dumping the output of the first MR(s) into a Review DB is an easy way to
> take advantage of code that's already written, but it's a bit wasteful.  We
> could just take the results directly from the view btree(s) and send them to
> the next step in the workflow.
>
> * I'm not yet sold on the HTTP API in Wout's proposal.  I think I'd prefer
> to keep the existing API, and in the _design doc specify the full workflow
> required to generate a given view.
>
> Cheers, Adam
>
> On Apr 22, 2009, at 10:53 AM, Zachary Zolton wrote:
>
>> Such as having view definition, in the design doc, contain an array of
>> objects, each with the map/reduce function pair attributes?
>>
>> On Wed, Apr 22, 2009 at 9:48 AM, Adam Kocoloski <kocolosk@apache.org>
>> wrote:
>>>
>>> Hi Wout, thanks for writing this up.
>>>
>>> One comment about the map-only views:  I think you'll find that Couch has
>>> already done a good bit of the work needed to support them, too.  Couch
>>> maintains a btree for each design doc keyed on docid that stores all the
>>> view keys emitted by the maps over each document.  When a document is
>>> updated and then analyzed, Couch has to consult that btree, purge all the
>>> KVs associated with the old version of the doc from each view, and then
>>> insert the new KVs.  So the tracking information correlating docids and
>>> view
>>> keys is already available.
>>>
>>> You'd still be left with the problem of generating unique docids for the
>>> documents in the Review DB, but I think that's a problem that needs to be
>>> solved.  The restriction to only MR views with no duplicate keys across
>>> views seems too strong to me.
>>>
>>> With that said, I'd prefer to spend my time extending the view engine to
>>> handle chainable MR workflows in a single shot.  Especially in the simple
>>> sort_by_value case it just seems like a cleaner way to go about things.
>>>  Cheers,
>>>
>>> Adam
>>>
>>> On Apr 22, 2009, at 8:40 AM, Wout Mertens wrote:
>>>
>>>> Intro
>>>> =====
>>>> How do you sort by reduce value? How do you join views? How do you get
>>>> unique view results? How do you cache group key reduces?
>>>>
>>>> I think that with the below proposed solution all the above and more are
>>>> possible. The general idea is to store view results and run map/reduce
>>>> on
>>>> them. There's been some discussions about this but they went nowhere.
>>>> I've
>>>> been thinking about this issue a bit and I think it can be done.
>>>>
>>>> I'd like to call this feature a Review DB.
>>>>
>>>> Use cases
>>>> =========
>>>> - Suppose you want to know what tags are most popular on your blog.
>>>> Simply
>>>> get:
>>>>
>>>>
>>>>
>>>>  http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
>>>>
>>>> Where tags_by_count is a Review DB that gets input from the tagcount
>>>> view
>>>> and then runs the sort_by_value view on it, a map() function that simply
>>>> emits (value,key).
>>>>
>>>> Likewise, show pages in order of popularity, whereby user can vote up
>>>> (+1)
>>>> or down (-1):
>>>>
>>>>  http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
>>>>
>>>> - Given documents with attributes title, date and tags. You'd like to
>>>> know
>>>> the minimum value of date and a breakdown by count for tags, for every
>>>> title. Normally you'd use 2 map+reduce views, minimum_date_by_title and
>>>> tagcount_by_title, which you would then query separately. With a Review
>>>> DB,
>>>> you can let both views insert their results in the database and then run
>>>> a
>>>> view that combines the results in one view:
>>>>
>>>>
>>>>
>>>>  http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
>>>>
>>>> - This is not a way to run an on-the-fly map/reduce on a subset of a
>>>> view,
>>>> like if you want to find the median popularity score of restaurants with
>>>> "Tony" in their name that are close to you.
>>>>
>>>> Implementation
>>>> ==============
>>>> A Review DB is a hidden database maintained by CouchDB with these
>>>> fields:
>>>> - _id of document is the string representation of the key
>>>> - "key" is the key of the incoming view row (unique)
>>>> - "value" is the value of the incoming view row
>>>>
>>>> I hope that this is sufficiently like a normal view that it can be
>>>> stored
>>>> as a normal view. _id is just there to make it doc-compliant, it would
>>>> be
>>>> much better if "key" were the actual key.
>>>>
>>>> A Review DB is defined in a design document like normal views. Each
>>>> review
>>>> is an entry in the "reviews" hash, and has a "incoming_views" array that
>>>> lists all the views that should insert results in the review db plus the
>>>> group level, as well as a normal "views" hash for further map/reduce of
>>>> the
>>>> review db (and perhaps another "reviews" hash for further result
>>>> processing?).
>>>>
>>>> Maintaining a database of results means that results have to be updated
>>>> or
>>>> even removed when documents change. I tried to make this work (in
>>>> theory)
>>>> for map-only views, but the resulting requirements are quite messy. You
>>>> either need to cache the previous results of a view for each document,
>>>> or
>>>> you have to have an old version of the document available to regenerate
>>>> those results.
>>>>
>>>> Therefore, a Review DB only accepts results from one or more map+reduce
>>>> views. You define beforehand what the group_level of the keys is that
>>>> will
>>>> be inserted.
>>>>
>>>> Furthermore, a Review DB disallows (but doesn't enforce) having 2 views
>>>> that generate the same keys. Otherwise, refcounting would need to be
>>>> used
>>>> and while that's not difficult, I think there's limited value in
>>>> allowing
>>>> this.
>>>>
>>>> The Review DB needs updating every time the reduction for a group key of
>>>> one of the participating views gets updated. Even though a map+reduce
>>>> view
>>>> has unique keys, we need a refcount since we have multiple views.
>>>> Whoever
>>>> got to insert its value last wins.
>>>>
>>>> There is a slight complication: group key values are calculated
>>>> on-the-fly
>>>> from the view result b-tree. So whenever a reduce call results in a new
>>>> value for a b-tree node, AND that node is the upper node of a subtree
>>>> that
>>>> is completely part of a group key, that group key needs to be marked for
>>>> recalculation.
>>>>
>>>> Likewise, if deletion/addition of a b-tree node results in the
>>>> removal/creation of the sole upper node of a group key subtree, that
>>>> group
>>>> key needs to be marked for removal/addition.
>>>>
>>>> This is the algorithm:
>>>> - When a reducing view gets updated, and it is part of a Review DB, use
>>>> the 2 paragraphs above to keep a list of group keys that need handling
>>>> - After updating the reduce() results, for each of the marked group
>>>> keys:
>>>> - If a group key gets removed:
>>>>  - look up doc with key=group key in review db. If exists:
>>>>   - delete doc
>>>> - If a group key gets added:
>>>>  - look up doc with key=group key in review db. If exists:
>>>>   - set doc.value to the row value
>>>>  - else
>>>>   - create doc with id=group key in string form, key=group key,
>>>> value=value
>>>> - If a group key gets updated:
>>>>  - look up doc with key=group key in review db. If exists:
>>>>   - set doc.value to the row value
>>>>  - else
>>>>   - create doc with id=group key in string form, key=group key,
>>>> value=value
>>>>
>>>> As you can see, this is something CouchDB should do since it knows when
>>>> it's updating group key reduction values and it knows if this was an
>>>> delete,
>>>> update or addition.
>>>>
>>>> View updates are done when the view is called; Review updates are done
>>>> at
>>>> this time as well. Views on Review DBs are done when they are called.
>>>>
>>>> Summary
>>>> =======
>>>> Review DBs are a sort of view index that CouchDB can maintain with
>>>> little
>>>> overhead. It caches group key results and allows chained map+reduce
>>>> calculations using mostly existing frameworks.
>>>>
>>>> I think this would be a very useful feature for CouchDB to have. There
>>>> are
>>>> regularly requests for storing view results in a database for
>>>> post-processing on the mailing lists.
>>>>
>>>> I'm not saying this is a trivial change but it doesn't seem technically
>>>> impossible to me either. (unless I missed something again; this is the
>>>> 5th
>>>> iteration of this proposal. Anyway I know *I* wouldn't be able to code
>>>> this
>>>> :-) )
>>>>
>>>> What do you think, oh dear devs?
>>>>
>>>> Wout.
>>>
>>>
>
>

Mime
View raw message