couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Proposal: Review DBs
Date Mon, 27 Apr 2009 18:44:41 GMT
Zachary,

No worries, the rough outline I'd do here is something like:

1. Figure out some member structure in the _design document that will 
represent your data flow. For them moment I would do something extremely 
simple as in:

Assume:
db_name = "base_db"

{
    "_id":  " _design/base",
    "views": {
        "stage-1": {
            "map": "function(doc) ...",
            "reduce": "function(keys, values, rereduce) ..."
        }
    },
    "review": [
        {"map": "function(doc) ...", "reduce": "function(keys, vals, 
rereduce) ..."},
        {"map": "function(doc) ...", "reduce": "function(keys, vals, 
rereduce) ..."}
    ]
}

So the review member becomes the stages in your data flow. I'm avoiding 
any forking or merging in this example in honor of the "make it work, 
make it not suck" development flow.

Now the basic algorithm would be something like:

For each array element in the "review" member, create a db something like:

base_db-stage-1 with a design document that contains a view with the 
first element of the "reviews" member.
base_db-stage-2 with the second member and so on.

Then your script can check the view status in each database either with 
a cron (or an update_notifier) to do so, you can just:

HEAD /base_db/_design/base/_view/stage-1

And then check the returned ETag. For the moment this is exactly 
equivalent to checking the database's update_seq because of how the etag 
is calculated, but in the future when we track the last update_seq for 
each view change this will be a free upgrade. Plus there's a bit more 
logical-ness to checking "view state" instead of "db state".

When the etag's don't match, you can just drop the next db in the flow, 
create it, and then copy the view output. The drop/create just makes the 
algorithm easily implementable for now. In the future there can be some 
extra logic to only change the new view as far as it requires by 
iterating over the two views and doing a merge sortish type of thing. I 
think... Sounds like there should be a way at least.

Once that works we can look at bolting on different fancy things like 
having forking map/reduce mechanisms and my current pet idea of adding 
in the merge stuff that has been talked about.

This is actually starting to sound like a fun little project....

HTH,
Paul Davis

Zachary Zolton wrote:
> paul
>
> alright... you've gotta give me the remedial explanation of what you
> meant here! (sorry, i'm still noob-ish)
>
> so, are you saying that i shouldn't even check for individual doc
> updates, but instead just recreate the entire database? that sounds
> like a job for cron, more so than the update notifier, right?
>
> i'd put up my current ruby script, but it deals update notifications
> in a way that's very specific to my data —probably very naïvely, to
> boot!
>
> zach
>
> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
> <paul.joseph.davis@gmail.com> wrote:
>   
>> Zachary,
>>
>> Awesome. The thing with non-incremental updates is that the basic algorithm
>> would be to just look for updates to the view and on update, delete the
>> review DB, create a new one, and then dump the new data into it. I wouldn't
>> try too hard for the optimizing updates at this point in time.
>>
>> Getting a ruby script out to show the basics should probably be the first
>> step. Beyond that we'll have to take it a step at a time.
>>
>> HTH,
>> Paul Davis
>>
>>
>> Zachary Zolton wrote:
>>     
>>> @jchris et al,
>>>
>>> if you had any pointer, on how to implement this, i have a strong
>>> motivation to try my hand at it.
>>>
>>> i have a janky ruby script running as an update notifier that looks
>>> for certain criteria, idiomatic to my data, that puts docs into a
>>> derived database. but i'm not terribly happy with my current
>>> implementation...
>>>
>>> is there a general-purpose algorithm for dealing with updates?
>>>
>>>
>>> cheers,
>>>
>>> zach
>>>
>>>
>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <jchris@gmail.com> wrote:
>>>
>>>       
>>>> Sent from my iPhone
>>>>
>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <wout.mertens@gmail.com>
wrote:
>>>>
>>>>
>>>>         
>>>>> Hi Adam,
>>>>>
>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Hi Wout, thanks for writing this up.
>>>>>>
>>>>>> One comment about the map-only views:  I think you'll find that Couch
>>>>>> has
>>>>>> already done a good bit of the work needed to support them, too.
 Couch
>>>>>> maintains a btree for each design doc keyed on docid that stores
all
>>>>>> the
>>>>>> view keys emitted by the maps over each document.  When a document
is
>>>>>> updated and then analyzed, Couch has to consult that btree, purge
all
>>>>>> the
>>>>>> KVs associated with the old version of the doc from each view, and
then
>>>>>> insert the new KVs.  So the tracking information correlating docids
and
>>>>>> view
>>>>>> keys is already available.
>>>>>>
>>>>>>             
>>>>> See I did not know that :-) Although I should have guessed.
>>>>>
>>>>> However, in the mail before this one I argued that it doesn't make sense
>>>>> to combine or chain map-only views since you can always write a map
>>>>> function
>>>>> that does it in one step. Do you agree?
>>>>>
>>>>> You might also know the answer to this: is it possible to make the
>>>>> Review
>>>>> DB be a sort of view index on the current database? All it needs are
>>>>> JSON
>>>>> keys and values, no other fields.
>>>>>
>>>>>
>>>>>           
>>>>>> You'd still be left with the problem of generating unique docids
for
>>>>>> the
>>>>>> documents in the Review DB, but I think that's a problem that needs
to
>>>>>> be
>>>>>> solved.  The restriction to only MR views with no duplicate keys
across
>>>>>> views seems too strong to me.
>>>>>>
>>>>>>             
>>>>> Well, since the Review DB is a local(*) hidden database that's handled
a
>>>>> bit specially, I think the easiest is to assign _id a sequence number
>>>>> and
>>>>> create a default view that indexes the documents by doc.key (for
>>>>> updating
>>>>> the value for that key). There will never be contention and we're only
>>>>> interested in the key index.
>>>>>
>>>>>           
>>>> We discussed this a little at CouchHack and I argued that the simplest
>>>> solution is actually good for a few reasons.
>>>>
>>>> The simple solution: provide a mechanism to copy the rows of a grouped
>>>> reduce function to a new database.
>>>>
>>>> Good because it is most like Hadoop/Google style map reduce. In that
>>>> paradigm, the output of a map/reduce job is not incremental, and it is
>>>> persisted in a way that allows for multiple later reduce stages to be run
>>>> on
>>>> it. It's common in Hadoop to chain many m/r stages, and to try a few
>>>> iterations of each stage while developing code.
>>>>
>>>> I like this also because it provides the needed functionality without
>>>> adding
>>>> any new primitives to CouchDB.
>>>>
>>>> The only downside of this approach is that it is not incremental. I'm not
>>>> sure that incremental chainability has much promise, as the index
>>>> management
>>>> could be a pain, especially if you have branching chains.
>>>>
>>>> Another upside is that by reducing to a db, you give the user power to do
>>>> things like use replication to merge multiple data sets before applying
>>>> more
>>>> views.
>>>>
>>>> I don't want to discourage anyone from experimenting with code, just want
>>>> to
>>>> point out this simple solution which would be Very Easy to implement.
>>>>
>>>>
>>>>         
>>>>> (*)local: I'm assuming that views are not replicated and need to be
>>>>> recalculated for each CouchDB node. If they are replicated somehow, I
>>>>> think
>>>>> it would still work but we'd have to look at it a little more.
>>>>>
>>>>>
>>>>>           
>>>>>> With that said, I'd prefer to spend my time extending the view engine
>>>>>> to
>>>>>> handle chainable MR workflows in a single shot.  Especially in the
>>>>>> simple
>>>>>> sort_by_value case it just seems like a cleaner way to go about things.
>>>>>>
>>>>>>             
>>>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>>>
>>>>> In a nutshell, I'm hoping that:
>>>>> * A review is a new sort of view that has an "inputs" array in its
>>>>> definition.
>>>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>>>> * It builds a persistent index of the incoming views when those get
>>>>> updated.
>>>>> * That index is then used to build the view index for the review when
>>>>> the
>>>>> review gets updated.
>>>>> * I think I covered the most important algorithms needed to implement
>>>>> this
>>>>> in my original proposal.
>>>>>
>>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>>
>>>>> Wout.
>>>>>
>>>>>           
>>     


Mime
View raw message