couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Proposal: Review DBs
Date Mon, 27 Apr 2009 16:28:02 GMT
Zachary,

Awesome. The thing with non-incremental updates is that the basic 
algorithm would be to just look for updates to the view and on update, 
delete the review DB, create a new one, and then dump the new data into 
it. I wouldn't try too hard for the optimizing updates at this point in 
time.

Getting a ruby script out to show the basics should probably be the 
first step. Beyond that we'll have to take it a step at a time.

HTH,
Paul Davis


Zachary Zolton wrote:
> @jchris et al,
>
> if you had any pointer, on how to implement this, i have a strong
> motivation to try my hand at it.
>
> i have a janky ruby script running as an update notifier that looks
> for certain criteria, idiomatic to my data, that puts docs into a
> derived database. but i'm not terribly happy with my current
> implementation...
>
> is there a general-purpose algorithm for dealing with updates?
>
>
> cheers,
>
> zach
>
>
> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <jchris@gmail.com> wrote:
>   
>> Sent from my iPhone
>>
>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <wout.mertens@gmail.com> wrote:
>>
>>     
>>> Hi Adam,
>>>
>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>
>>>       
>>>> Hi Wout, thanks for writing this up.
>>>>
>>>> One comment about the map-only views:  I think you'll find that Couch has
>>>> already done a good bit of the work needed to support them, too.  Couch
>>>> maintains a btree for each design doc keyed on docid that stores all the
>>>> view keys emitted by the maps over each document.  When a document is
>>>> updated and then analyzed, Couch has to consult that btree, purge all the
>>>> KVs associated with the old version of the doc from each view, and then
>>>> insert the new KVs.  So the tracking information correlating docids and view
>>>> keys is already available.
>>>>         
>>> See I did not know that :-) Although I should have guessed.
>>>
>>> However, in the mail before this one I argued that it doesn't make sense
>>> to combine or chain map-only views since you can always write a map function
>>> that does it in one step. Do you agree?
>>>
>>> You might also know the answer to this: is it possible to make the Review
>>> DB be a sort of view index on the current database? All it needs are JSON
>>> keys and values, no other fields.
>>>
>>>       
>>>> You'd still be left with the problem of generating unique docids for the
>>>> documents in the Review DB, but I think that's a problem that needs to be
>>>> solved.  The restriction to only MR views with no duplicate keys across
>>>> views seems too strong to me.
>>>>         
>>> Well, since the Review DB is a local(*) hidden database that's handled a
>>> bit specially, I think the easiest is to assign _id a sequence number and
>>> create a default view that indexes the documents by doc.key (for updating
>>> the value for that key). There will never be contention and we're only
>>> interested in the key index.
>>>       
>> We discussed this a little at CouchHack and I argued that the simplest
>> solution is actually good for a few reasons.
>>
>> The simple solution: provide a mechanism to copy the rows of a grouped
>> reduce function to a new database.
>>
>> Good because it is most like Hadoop/Google style map reduce. In that
>> paradigm, the output of a map/reduce job is not incremental, and it is
>> persisted in a way that allows for multiple later reduce stages to be run on
>> it. It's common in Hadoop to chain many m/r stages, and to try a few
>> iterations of each stage while developing code.
>>
>> I like this also because it provides the needed functionality without adding
>> any new primitives to CouchDB.
>>
>> The only downside of this approach is that it is not incremental. I'm not
>> sure that incremental chainability has much promise, as the index management
>> could be a pain, especially if you have branching chains.
>>
>> Another upside is that by reducing to a db, you give the user power to do
>> things like use replication to merge multiple data sets before applying more
>> views.
>>
>> I don't want to discourage anyone from experimenting with code, just want to
>> point out this simple solution which would be Very Easy to implement.
>>
>>     
>>> (*)local: I'm assuming that views are not replicated and need to be
>>> recalculated for each CouchDB node. If they are replicated somehow, I think
>>> it would still work but we'd have to look at it a little more.
>>>
>>>       
>>>> With that said, I'd prefer to spend my time extending the view engine to
>>>> handle chainable MR workflows in a single shot.  Especially in the simple
>>>> sort_by_value case it just seems like a cleaner way to go about things.
>>>>         
>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>
>>> In a nutshell, I'm hoping that:
>>> * A review is a new sort of view that has an "inputs" array in its
>>> definition.
>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>> * It builds a persistent index of the incoming views when those get
>>> updated.
>>> * That index is then used to build the view index for the review when the
>>> review gets updated.
>>> * I think I covered the most important algorithms needed to implement this
>>> in my original proposal.
>>>
>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>
>>> Wout.
>>>       


Mime
View raw message