couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Proposal: Review DBs
Date Mon, 27 Apr 2009 19:48:29 GMT
Zachary,

Hmm. Naming your derived databases as base_db-stage-etag doesn't sound 
like a bad idea. Though I dunno how you'd communicate to clients to 
start hitting the new versions and also doesn't tell the admins when to 
drop old indices.

The only thing that comes to mind is to stick some intermediary in 
between clients and the actual derived data to make it transparent and 
also to allow you to know when you can clean up old versions etc.

I'll keep thinking on it.

Paul

Zachary Zolton wrote:
> okay, i'm starting to get ya. my question is, if i'm constantly
> dropping/recreating/reindexing the derived database, how can i keep
> serving requests from my website?
>
> one possible solution would be to time/etag/etc-stamp the derived db
> name, but that would seem to add a number of moving parts to my
> system.
>
> hmm... any ideas of how to pull a quick switcheroo on the backend of
> my system, without too much hassle in the client code?
>
> On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
>   
>> Zachary,
>>
>> No worries, the rough outline I'd do here is something like:
>>
>> 1. Figure out some member structure in the _design document that will
>> represent your data flow. For them moment I would do something extremely
>> simple as in:
>>
>> Assume:
>> db_name = "base_db"
>>
>> {
>>   "_id":  " _design/base",
>>   "views": {
>>       "stage-1": {
>>           "map": "function(doc) ...",
>>           "reduce": "function(keys, values, rereduce) ..."
>>       }
>>   },
>>   "review": [
>>       {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce)
>> ..."},
>>       {"map": "function(doc) ...", "reduce": "function(keys, vals, rereduce)
>> ..."}
>>   ]
>> }
>>
>> So the review member becomes the stages in your data flow. I'm avoiding any
>> forking or merging in this example in honor of the "make it work, make it
>> not suck" development flow.
>>
>> Now the basic algorithm would be something like:
>>
>> For each array element in the "review" member, create a db something like:
>>
>> base_db-stage-1 with a design document that contains a view with the first
>> element of the "reviews" member.
>> base_db-stage-2 with the second member and so on.
>>
>> Then your script can check the view status in each database either with a
>> cron (or an update_notifier) to do so, you can just:
>>
>> HEAD /base_db/_design/base/_view/stage-1
>>
>> And then check the returned ETag. For the moment this is exactly equivalent
>> to checking the database's update_seq because of how the etag is calculated,
>> but in the future when we track the last update_seq for each view change
>> this will be a free upgrade. Plus there's a bit more logical-ness to
>> checking "view state" instead of "db state".
>>
>> When the etag's don't match, you can just drop the next db in the flow,
>> create it, and then copy the view output. The drop/create just makes the
>> algorithm easily implementable for now. In the future there can be some
>> extra logic to only change the new view as far as it requires by iterating
>> over the two views and doing a merge sortish type of thing. I think...
>> Sounds like there should be a way at least.
>>
>> Once that works we can look at bolting on different fancy things like having
>> forking map/reduce mechanisms and my current pet idea of adding in the merge
>> stuff that has been talked about.
>>
>> This is actually starting to sound like a fun little project....
>>
>> HTH,
>> Paul Davis
>>
>> Zachary Zolton wrote:
>>     
>>> paul
>>>
>>> alright... you've gotta give me the remedial explanation of what you
>>> meant here! (sorry, i'm still noob-ish)
>>>
>>> so, are you saying that i shouldn't even check for individual doc
>>> updates, but instead just recreate the entire database? that sounds
>>> like a job for cron, more so than the update notifier, right?
>>>
>>> i'd put up my current ruby script, but it deals update notifications
>>> in a way that's very specific to my data —probably very naïvely, to
>>> boot!
>>>
>>> zach
>>>
>>> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
>>> <paul.joseph.davis@gmail.com> wrote:
>>>
>>>       
>>>> Zachary,
>>>>
>>>> Awesome. The thing with non-incremental updates is that the basic
>>>> algorithm
>>>> would be to just look for updates to the view and on update, delete the
>>>> review DB, create a new one, and then dump the new data into it. I
>>>> wouldn't
>>>> try too hard for the optimizing updates at this point in time.
>>>>
>>>> Getting a ruby script out to show the basics should probably be the first
>>>> step. Beyond that we'll have to take it a step at a time.
>>>>
>>>> HTH,
>>>> Paul Davis
>>>>
>>>>
>>>> Zachary Zolton wrote:
>>>>
>>>>         
>>>>> @jchris et al,
>>>>>
>>>>> if you had any pointer, on how to implement this, i have a strong
>>>>> motivation to try my hand at it.
>>>>>
>>>>> i have a janky ruby script running as an update notifier that looks
>>>>> for certain criteria, idiomatic to my data, that puts docs into a
>>>>> derived database. but i'm not terribly happy with my current
>>>>> implementation...
>>>>>
>>>>> is there a general-purpose algorithm for dealing with updates?
>>>>>
>>>>>
>>>>> cheers,
>>>>>
>>>>> zach
>>>>>
>>>>>
>>>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <jchris@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <wout.mertens@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Adam,
>>>>>>>
>>>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Hi Wout, thanks for writing this up.
>>>>>>>>
>>>>>>>> One comment about the map-only views:  I think you'll find
that Couch
>>>>>>>> has
>>>>>>>> already done a good bit of the work needed to support them,
too.
>>>>>>>>  Couch
>>>>>>>> maintains a btree for each design doc keyed on docid that
stores all
>>>>>>>> the
>>>>>>>> view keys emitted by the maps over each document.  When a
document is
>>>>>>>> updated and then analyzed, Couch has to consult that btree,
purge all
>>>>>>>> the
>>>>>>>> KVs associated with the old version of the doc from each
view, and
>>>>>>>> then
>>>>>>>> insert the new KVs.  So the tracking information correlating
docids
>>>>>>>> and
>>>>>>>> view
>>>>>>>> keys is already available.
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> See I did not know that :-) Although I should have guessed.
>>>>>>>
>>>>>>> However, in the mail before this one I argued that it doesn't
make
>>>>>>> sense
>>>>>>> to combine or chain map-only views since you can always write
a map
>>>>>>> function
>>>>>>> that does it in one step. Do you agree?
>>>>>>>
>>>>>>> You might also know the answer to this: is it possible to make
the
>>>>>>> Review
>>>>>>> DB be a sort of view index on the current database? All it needs
are
>>>>>>> JSON
>>>>>>> keys and values, no other fields.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> You'd still be left with the problem of generating unique
docids for
>>>>>>>> the
>>>>>>>> documents in the Review DB, but I think that's a problem
that needs
>>>>>>>> to
>>>>>>>> be
>>>>>>>> solved.  The restriction to only MR views with no duplicate
keys
>>>>>>>> across
>>>>>>>> views seems too strong to me.
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> Well, since the Review DB is a local(*) hidden database that's
handled
>>>>>>> a
>>>>>>> bit specially, I think the easiest is to assign _id a sequence
number
>>>>>>> and
>>>>>>> create a default view that indexes the documents by doc.key (for
>>>>>>> updating
>>>>>>> the value for that key). There will never be contention and we're
only
>>>>>>> interested in the key index.
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> We discussed this a little at CouchHack and I argued that the simplest
>>>>>> solution is actually good for a few reasons.
>>>>>>
>>>>>> The simple solution: provide a mechanism to copy the rows of a grouped
>>>>>> reduce function to a new database.
>>>>>>
>>>>>> Good because it is most like Hadoop/Google style map reduce. In that
>>>>>> paradigm, the output of a map/reduce job is not incremental, and
it is
>>>>>> persisted in a way that allows for multiple later reduce stages to
be
>>>>>> run
>>>>>> on
>>>>>> it. It's common in Hadoop to chain many m/r stages, and to try a
few
>>>>>> iterations of each stage while developing code.
>>>>>>
>>>>>> I like this also because it provides the needed functionality without
>>>>>> adding
>>>>>> any new primitives to CouchDB.
>>>>>>
>>>>>> The only downside of this approach is that it is not incremental.
I'm
>>>>>> not
>>>>>> sure that incremental chainability has much promise, as the index
>>>>>> management
>>>>>> could be a pain, especially if you have branching chains.
>>>>>>
>>>>>> Another upside is that by reducing to a db, you give the user power
to
>>>>>> do
>>>>>> things like use replication to merge multiple data sets before applying
>>>>>> more
>>>>>> views.
>>>>>>
>>>>>> I don't want to discourage anyone from experimenting with code, just
>>>>>> want
>>>>>> to
>>>>>> point out this simple solution which would be Very Easy to implement.
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> (*)local: I'm assuming that views are not replicated and need
to be
>>>>>>> recalculated for each CouchDB node. If they are replicated somehow,
I
>>>>>>> think
>>>>>>> it would still work but we'd have to look at it a little more.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> With that said, I'd prefer to spend my time extending the
view engine
>>>>>>>> to
>>>>>>>> handle chainable MR workflows in a single shot.  Especially
in the
>>>>>>>> simple
>>>>>>>> sort_by_value case it just seems like a cleaner way to go
about
>>>>>>>> things.
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>>>>>
>>>>>>> In a nutshell, I'm hoping that:
>>>>>>> * A review is a new sort of view that has an "inputs" array in
its
>>>>>>> definition.
>>>>>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>>>>>> * It builds a persistent index of the incoming views when those
get
>>>>>>> updated.
>>>>>>> * That index is then used to build the view index for the review
when
>>>>>>> the
>>>>>>> review gets updated.
>>>>>>> * I think I covered the most important algorithms needed to implement
>>>>>>> this
>>>>>>> in my original proposal.
>>>>>>>
>>>>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>>>>
>>>>>>> Wout.
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>         
>>     


Mime
View raw message