couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Proposal: Review DBs
Date Mon, 27 Apr 2009 21:27:20 GMT
Perhaps the background job could maintain some versioning number, in
the base DB's design doc, which clients could use to know which
version of the derived database to hit.

See? ;^) This might be simpler, as a 1st-class CouchDB feature, than a
bolted-on script.

On Mon, Apr 27, 2009 at 2:48 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> Zachary,
>
> Hmm. Naming your derived databases as base_db-stage-etag doesn't sound like
> a bad idea. Though I dunno how you'd communicate to clients to start hitting
> the new versions and also doesn't tell the admins when to drop old indices.
>
> The only thing that comes to mind is to stick some intermediary in between
> clients and the actual derived data to make it transparent and also to allow
> you to know when you can clean up old versions etc.
>
> I'll keep thinking on it.
>
> Paul
>
> Zachary Zolton wrote:
>>
>> okay, i'm starting to get ya. my question is, if i'm constantly
>> dropping/recreating/reindexing the derived database, how can i keep
>> serving requests from my website?
>>
>> one possible solution would be to time/etag/etc-stamp the derived db
>> name, but that would seem to add a number of moving parts to my
>> system.
>>
>> hmm... any ideas of how to pull a quick switcheroo on the backend of
>> my system, without too much hassle in the client code?
>>
>> On Mon, Apr 27, 2009 at 1:44 PM, Paul Davis <paul.joseph.davis@gmail.com>
>> wrote:
>>
>>>
>>> Zachary,
>>>
>>> No worries, the rough outline I'd do here is something like:
>>>
>>> 1. Figure out some member structure in the _design document that will
>>> represent your data flow. For them moment I would do something extremely
>>> simple as in:
>>>
>>> Assume:
>>> db_name = "base_db"
>>>
>>> {
>>>  "_id":  " _design/base",
>>>  "views": {
>>>      "stage-1": {
>>>          "map": "function(doc) ...",
>>>          "reduce": "function(keys, values, rereduce) ..."
>>>      }
>>>  },
>>>  "review": [
>>>      {"map": "function(doc) ...", "reduce": "function(keys, vals,
>>> rereduce)
>>> ..."},
>>>      {"map": "function(doc) ...", "reduce": "function(keys, vals,
>>> rereduce)
>>> ..."}
>>>  ]
>>> }
>>>
>>> So the review member becomes the stages in your data flow. I'm avoiding
>>> any
>>> forking or merging in this example in honor of the "make it work, make it
>>> not suck" development flow.
>>>
>>> Now the basic algorithm would be something like:
>>>
>>> For each array element in the "review" member, create a db something
>>> like:
>>>
>>> base_db-stage-1 with a design document that contains a view with the
>>> first
>>> element of the "reviews" member.
>>> base_db-stage-2 with the second member and so on.
>>>
>>> Then your script can check the view status in each database either with a
>>> cron (or an update_notifier) to do so, you can just:
>>>
>>> HEAD /base_db/_design/base/_view/stage-1
>>>
>>> And then check the returned ETag. For the moment this is exactly
>>> equivalent
>>> to checking the database's update_seq because of how the etag is
>>> calculated,
>>> but in the future when we track the last update_seq for each view change
>>> this will be a free upgrade. Plus there's a bit more logical-ness to
>>> checking "view state" instead of "db state".
>>>
>>> When the etag's don't match, you can just drop the next db in the flow,
>>> create it, and then copy the view output. The drop/create just makes the
>>> algorithm easily implementable for now. In the future there can be some
>>> extra logic to only change the new view as far as it requires by
>>> iterating
>>> over the two views and doing a merge sortish type of thing. I think...
>>> Sounds like there should be a way at least.
>>>
>>> Once that works we can look at bolting on different fancy things like
>>> having
>>> forking map/reduce mechanisms and my current pet idea of adding in the
>>> merge
>>> stuff that has been talked about.
>>>
>>> This is actually starting to sound like a fun little project....
>>>
>>> HTH,
>>> Paul Davis
>>>
>>> Zachary Zolton wrote:
>>>
>>>>
>>>> paul
>>>>
>>>> alright... you've gotta give me the remedial explanation of what you
>>>> meant here! (sorry, i'm still noob-ish)
>>>>
>>>> so, are you saying that i shouldn't even check for individual doc
>>>> updates, but instead just recreate the entire database? that sounds
>>>> like a job for cron, more so than the update notifier, right?
>>>>
>>>> i'd put up my current ruby script, but it deals update notifications
>>>> in a way that's very specific to my data —probably very naïvely, to
>>>> boot!
>>>>
>>>> zach
>>>>
>>>> On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
>>>> <paul.joseph.davis@gmail.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Zachary,
>>>>>
>>>>> Awesome. The thing with non-incremental updates is that the basic
>>>>> algorithm
>>>>> would be to just look for updates to the view and on update, delete the
>>>>> review DB, create a new one, and then dump the new data into it. I
>>>>> wouldn't
>>>>> try too hard for the optimizing updates at this point in time.
>>>>>
>>>>> Getting a ruby script out to show the basics should probably be the
>>>>> first
>>>>> step. Beyond that we'll have to take it a step at a time.
>>>>>
>>>>> HTH,
>>>>> Paul Davis
>>>>>
>>>>>
>>>>> Zachary Zolton wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> @jchris et al,
>>>>>>
>>>>>> if you had any pointer, on how to implement this, i have a strong
>>>>>> motivation to try my hand at it.
>>>>>>
>>>>>> i have a janky ruby script running as an update notifier that looks
>>>>>> for certain criteria, idiomatic to my data, that puts docs into a
>>>>>> derived database. but i'm not terribly happy with my current
>>>>>> implementation...
>>>>>>
>>>>>> is there a general-purpose algorithm for dealing with updates?
>>>>>>
>>>>>>
>>>>>> cheers,
>>>>>>
>>>>>> zach
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <jchris@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <wout.mertens@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Adam,
>>>>>>>>
>>>>>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Wout, thanks for writing this up.
>>>>>>>>>
>>>>>>>>> One comment about the map-only views:  I think you'll
find that
>>>>>>>>> Couch
>>>>>>>>> has
>>>>>>>>> already done a good bit of the work needed to support
them, too.
>>>>>>>>>  Couch
>>>>>>>>> maintains a btree for each design doc keyed on docid
that stores
>>>>>>>>> all
>>>>>>>>> the
>>>>>>>>> view keys emitted by the maps over each document.  When
a document
>>>>>>>>> is
>>>>>>>>> updated and then analyzed, Couch has to consult that
btree, purge
>>>>>>>>> all
>>>>>>>>> the
>>>>>>>>> KVs associated with the old version of the doc from each
view, and
>>>>>>>>> then
>>>>>>>>> insert the new KVs.  So the tracking information correlating
docids
>>>>>>>>> and
>>>>>>>>> view
>>>>>>>>> keys is already available.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> See I did not know that :-) Although I should have guessed.
>>>>>>>>
>>>>>>>> However, in the mail before this one I argued that it doesn't
make
>>>>>>>> sense
>>>>>>>> to combine or chain map-only views since you can always write
a map
>>>>>>>> function
>>>>>>>> that does it in one step. Do you agree?
>>>>>>>>
>>>>>>>> You might also know the answer to this: is it possible to
make the
>>>>>>>> Review
>>>>>>>> DB be a sort of view index on the current database? All it
needs are
>>>>>>>> JSON
>>>>>>>> keys and values, no other fields.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You'd still be left with the problem of generating unique
docids
>>>>>>>>> for
>>>>>>>>> the
>>>>>>>>> documents in the Review DB, but I think that's a problem
that needs
>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>> solved.  The restriction to only MR views with no duplicate
keys
>>>>>>>>> across
>>>>>>>>> views seems too strong to me.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well, since the Review DB is a local(*) hidden database that's
>>>>>>>> handled
>>>>>>>> a
>>>>>>>> bit specially, I think the easiest is to assign _id a sequence
>>>>>>>> number
>>>>>>>> and
>>>>>>>> create a default view that indexes the documents by doc.key
(for
>>>>>>>> updating
>>>>>>>> the value for that key). There will never be contention and
we're
>>>>>>>> only
>>>>>>>> interested in the key index.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> We discussed this a little at CouchHack and I argued that the
>>>>>>> simplest
>>>>>>> solution is actually good for a few reasons.
>>>>>>>
>>>>>>> The simple solution: provide a mechanism to copy the rows of
a
>>>>>>> grouped
>>>>>>> reduce function to a new database.
>>>>>>>
>>>>>>> Good because it is most like Hadoop/Google style map reduce.
In that
>>>>>>> paradigm, the output of a map/reduce job is not incremental,
and it
>>>>>>> is
>>>>>>> persisted in a way that allows for multiple later reduce stages
to be
>>>>>>> run
>>>>>>> on
>>>>>>> it. It's common in Hadoop to chain many m/r stages, and to try
a few
>>>>>>> iterations of each stage while developing code.
>>>>>>>
>>>>>>> I like this also because it provides the needed functionality
without
>>>>>>> adding
>>>>>>> any new primitives to CouchDB.
>>>>>>>
>>>>>>> The only downside of this approach is that it is not incremental.
I'm
>>>>>>> not
>>>>>>> sure that incremental chainability has much promise, as the index
>>>>>>> management
>>>>>>> could be a pain, especially if you have branching chains.
>>>>>>>
>>>>>>> Another upside is that by reducing to a db, you give the user
power
>>>>>>> to
>>>>>>> do
>>>>>>> things like use replication to merge multiple data sets before
>>>>>>> applying
>>>>>>> more
>>>>>>> views.
>>>>>>>
>>>>>>> I don't want to discourage anyone from experimenting with code,
just
>>>>>>> want
>>>>>>> to
>>>>>>> point out this simple solution which would be Very Easy to implement.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> (*)local: I'm assuming that views are not replicated and
need to be
>>>>>>>> recalculated for each CouchDB node. If they are replicated
somehow,
>>>>>>>> I
>>>>>>>> think
>>>>>>>> it would still work but we'd have to look at it a little
more.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> With that said, I'd prefer to spend my time extending
the view
>>>>>>>>> engine
>>>>>>>>> to
>>>>>>>>> handle chainable MR workflows in a single shot.  Especially
in the
>>>>>>>>> simple
>>>>>>>>> sort_by_value case it just seems like a cleaner way to
go about
>>>>>>>>> things.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, that seems to be the gist of all repliers and I agree
:-)
>>>>>>>>
>>>>>>>> In a nutshell, I'm hoping that:
>>>>>>>> * A review is a new sort of view that has an "inputs" array
in its
>>>>>>>> definition.
>>>>>>>> * Only MR views are allowed as inputs, no KV duplication
allowed.
>>>>>>>> * It builds a persistent index of the incoming views when
those get
>>>>>>>> updated.
>>>>>>>> * That index is then used to build the view index for the
review
>>>>>>>> when
>>>>>>>> the
>>>>>>>> review gets updated.
>>>>>>>> * I think I covered the most important algorithms needed
to
>>>>>>>> implement
>>>>>>>> this
>>>>>>>> in my original proposal.
>>>>>>>>
>>>>>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>>>>>
>>>>>>>> Wout.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>
>>>
>
>

Mime
View raw message