couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: Proposal: Review DBs
Date Mon, 27 Apr 2009 18:17:05 GMT
paul

alright... you've gotta give me the remedial explanation of what you
meant here! (sorry, i'm still noob-ish)

so, are you saying that i shouldn't even check for individual doc
updates, but instead just recreate the entire database? that sounds
like a job for cron, more so than the update notifier, right?

i'd put up my current ruby script, but it deals update notifications
in a way that's very specific to my data —probably very naïvely, to
boot!

zach

On Mon, Apr 27, 2009 at 11:28 AM, Paul Davis
<paul.joseph.davis@gmail.com> wrote:
> Zachary,
>
> Awesome. The thing with non-incremental updates is that the basic algorithm
> would be to just look for updates to the view and on update, delete the
> review DB, create a new one, and then dump the new data into it. I wouldn't
> try too hard for the optimizing updates at this point in time.
>
> Getting a ruby script out to show the basics should probably be the first
> step. Beyond that we'll have to take it a step at a time.
>
> HTH,
> Paul Davis
>
>
> Zachary Zolton wrote:
>>
>> @jchris et al,
>>
>> if you had any pointer, on how to implement this, i have a strong
>> motivation to try my hand at it.
>>
>> i have a janky ruby script running as an update notifier that looks
>> for certain criteria, idiomatic to my data, that puts docs into a
>> derived database. but i'm not terribly happy with my current
>> implementation...
>>
>> is there a general-purpose algorithm for dealing with updates?
>>
>>
>> cheers,
>>
>> zach
>>
>>
>> On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <jchris@gmail.com> wrote:
>>
>>>
>>> Sent from my iPhone
>>>
>>> On Apr 26, 2009, at 2:26 PM, Wout Mertens <wout.mertens@gmail.com> wrote:
>>>
>>>
>>>>
>>>> Hi Adam,
>>>>
>>>> On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
>>>>
>>>>
>>>>>
>>>>> Hi Wout, thanks for writing this up.
>>>>>
>>>>> One comment about the map-only views:  I think you'll find that Couch
>>>>> has
>>>>> already done a good bit of the work needed to support them, too.  Couch
>>>>> maintains a btree for each design doc keyed on docid that stores all
>>>>> the
>>>>> view keys emitted by the maps over each document.  When a document is
>>>>> updated and then analyzed, Couch has to consult that btree, purge all
>>>>> the
>>>>> KVs associated with the old version of the doc from each view, and then
>>>>> insert the new KVs.  So the tracking information correlating docids
and
>>>>> view
>>>>> keys is already available.
>>>>>
>>>>
>>>> See I did not know that :-) Although I should have guessed.
>>>>
>>>> However, in the mail before this one I argued that it doesn't make sense
>>>> to combine or chain map-only views since you can always write a map
>>>> function
>>>> that does it in one step. Do you agree?
>>>>
>>>> You might also know the answer to this: is it possible to make the
>>>> Review
>>>> DB be a sort of view index on the current database? All it needs are
>>>> JSON
>>>> keys and values, no other fields.
>>>>
>>>>
>>>>>
>>>>> You'd still be left with the problem of generating unique docids for
>>>>> the
>>>>> documents in the Review DB, but I think that's a problem that needs to
>>>>> be
>>>>> solved.  The restriction to only MR views with no duplicate keys across
>>>>> views seems too strong to me.
>>>>>
>>>>
>>>> Well, since the Review DB is a local(*) hidden database that's handled a
>>>> bit specially, I think the easiest is to assign _id a sequence number
>>>> and
>>>> create a default view that indexes the documents by doc.key (for
>>>> updating
>>>> the value for that key). There will never be contention and we're only
>>>> interested in the key index.
>>>>
>>>
>>> We discussed this a little at CouchHack and I argued that the simplest
>>> solution is actually good for a few reasons.
>>>
>>> The simple solution: provide a mechanism to copy the rows of a grouped
>>> reduce function to a new database.
>>>
>>> Good because it is most like Hadoop/Google style map reduce. In that
>>> paradigm, the output of a map/reduce job is not incremental, and it is
>>> persisted in a way that allows for multiple later reduce stages to be run
>>> on
>>> it. It's common in Hadoop to chain many m/r stages, and to try a few
>>> iterations of each stage while developing code.
>>>
>>> I like this also because it provides the needed functionality without
>>> adding
>>> any new primitives to CouchDB.
>>>
>>> The only downside of this approach is that it is not incremental. I'm not
>>> sure that incremental chainability has much promise, as the index
>>> management
>>> could be a pain, especially if you have branching chains.
>>>
>>> Another upside is that by reducing to a db, you give the user power to do
>>> things like use replication to merge multiple data sets before applying
>>> more
>>> views.
>>>
>>> I don't want to discourage anyone from experimenting with code, just want
>>> to
>>> point out this simple solution which would be Very Easy to implement.
>>>
>>>
>>>>
>>>> (*)local: I'm assuming that views are not replicated and need to be
>>>> recalculated for each CouchDB node. If they are replicated somehow, I
>>>> think
>>>> it would still work but we'd have to look at it a little more.
>>>>
>>>>
>>>>>
>>>>> With that said, I'd prefer to spend my time extending the view engine
>>>>> to
>>>>> handle chainable MR workflows in a single shot.  Especially in the
>>>>> simple
>>>>> sort_by_value case it just seems like a cleaner way to go about things.
>>>>>
>>>>
>>>> Yes, that seems to be the gist of all repliers and I agree :-)
>>>>
>>>> In a nutshell, I'm hoping that:
>>>> * A review is a new sort of view that has an "inputs" array in its
>>>> definition.
>>>> * Only MR views are allowed as inputs, no KV duplication allowed.
>>>> * It builds a persistent index of the incoming views when those get
>>>> updated.
>>>> * That index is then used to build the view index for the review when
>>>> the
>>>> review gets updated.
>>>> * I think I covered the most important algorithms needed to implement
>>>> this
>>>> in my original proposal.
>>>>
>>>> Does this sound feasible? If so I'll update my proposal accordingly.
>>>>
>>>> Wout.
>>>>
>
>

Mime
View raw message