couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: multi-level views
Date Wed, 03 Jun 2009 20:53:02 GMT
Sorry, was gonna get back to this tonight, but I might as well chime
in now that I've been called out. :)

Firstly, with Cascade I was looking at it last weekend again planning
on how I was gonna make it awesome and do lots of neato things. Then I
realized that the level of complexity I was going to need to add was
going to kill it. Not just kill it, but pulverize the whole thing into
a stream of oozing bytes.

So after that realization I'm planning on making it more like Chris
Anderson's idea of just exposing the primitives for "copy this view
output to db X." while making updates to the derived database not
painfully slow (which should be doable). Then people can use these
primitives to do all of their mapping and chaining and reducing.

For reference, I've spent a lot of time looking into the problem of
making chainable map/reduce and adding in things like merge phases and
so on. All of this stuff is possible, but I think that the proper plan
of attack is to look forward and make this stuff easier than it
currently is. There are two things coming up that should be important
on this front:

Firstly, there was a ticket opened today on starting to reorganize the
src directories more meaningfully. I posted a note that says that if
we're going to shift how things are, then we should sit down and do it
right and finally split CouchDB up into smaller OTP style applications
that respect the Erlangisms of smaller independent apps. This might
sound unrelated, but as part of this (and point 2) we're given a
perfect chance to make sure that we can start separating the indexing
code so that we can foster a whole bunch of different indexers. One
thing we've always thought would be awesome, but haven't really had
any movement on.

Secondly, talking with Chris Chandler on #couchdb its starting to look
like we might want to take the current Map/Reduce indexer and abstract
out the mechanics for how updates are done. The end goal would be to
allow people to write new indexing methods without having to duplicate
the large amount of code (relatively speaking) that's required for
updates.

So in short, make it easy for people to add more awesomeness to CouchDB.

Paul Davis

On Wed, Jun 3, 2009 at 4:09 PM, Zachary Zolton <zachary.zolton@gmail.com> wrote:
> Yeah... I was gonna recommend Cascade, but I haven't seen any movement
> on Github for quite a while!
>
> Perhaps Paul Davis would like to chime in...? :^q
>
> I've been using an Update Notifier script for this kinda thing so
> far—also, not incrementally—but it's worked well enough for my needs.
> My primary desired would be to do this in a manner such that the
> application code doesn't need to know about the second database.
>
> On Wed, Jun 3, 2009 at 2:29 PM, Chris Anderson <jchris@apache.org> wrote:
>> On Wed, Jun 3, 2009 at 12:03 PM, Justin Balthrop <justin@geni.com> wrote:
>>> Nice! That sounds like exactly what I'm looking for. I don't think it will
>>> address the performance issues with reduce, but it's definitely a start.
>>>
>>> Do you mind sending a diff of your changes to couch_view_updater.erl? I
>>> diffed your file with trunk and there are a bunch of unrelated changes, of
>>> course.
>>
>> There's also a Paul Davis's Cascade:
>> http://github.com/davisp/cascade/tree/master
>>
>> I'm planning on writing something with Hovercraft that takes a group
>> reduce query and copies it to another database on demand. It wouldn't
>> try to be incremental, just provide for easy chaining.
>>
>> I think chaining by copying to a db is a good way to work, because it
>> lets you experiment with other views on top of your reduce rows,
>> without regenerating the whole thing.
>>
>> Chris
>>
>>>
>>> Thanks
>>>
>>>
>>> On Jun 3, 2009, at 1:42 AM, Viacheslav Seledkin wrote:
>>>
>>>> Justin Balthrop wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I've been reading the dev and user mailing lists for the past month or
>>>>> so, but haven't posted yet. I've fallen in love with couchdb, its
>>>>> power and simplicity, and I tell everyone who will listen why it is so
>>>>> much better than a relational db for most applications. I now have
>>>>> most of the engineering team at our company on board, and I'm in the
>>>>> process of converting our rails site from postgres to couchdb.
>>>>>
>>>>> So, after spending a few weeks converting models over to using
>>>>> couchdb, there is one feature that we are desperately missing:
>>>>>
>>>>> Multi-level map-reduce in views.
>>>>>
>>>>> We need a way to take the output of reduce and pass it back through
>>>>> another map-reduce step (multiple times in some cases). This way, we
>>>>> could build map-reduce flows that compute (and cache) any complex data
>>>>> computation we need.
>>>>>
>>>>> Our specific use case isn't incredibly important, because multi-level
>>>>> map-reduce could be useful in countless ways, but I'll include it
>>>>> anyway just as illustration. The specific need for us arose from the
>>>>> desire to slice up certain very large documents to make concurrent
>>>>> editing by a huge number of users feasible. Then we started to use a
>>>>> view step to combine the data back into whole documents. This worked
>>>>> really well at first, but we soon found that we needed to run
>>>>> additional queries on those documents. So we were stuck with either:
>>>>>
>>>>> 1) do the queries in the client - meaning we lose all the power and
>>>>> caching of couchdb views; or
>>>>> 2) reinsert the combined documents into another database - meaning we
>>>>> are storing the data twice, and we still have to deal with contention
>>>>> when modifying the compound documents in that database.
>>>>>
>>>>> Multi-level map-reduce would solve this problem perfectly!
>>>>>
>>>>> Multi-level views could also simplify and improve performance for
>>>>> reduce grouping. The reduce itself would work just like Google's map-
>>>>> reduce by only reducing values that have the exact same map key. Then
>>>>> if you want to reduce further, you can just use another map-reduce
>>>>> step on top of that with the map emitting a different key so the
>>>>> reduce data will be grouped differently. For example, if you wanted a
>>>>> count of posts per user and total posts, you would implement it as a
>>>>> two-level map-reduce with the key=user_id for map1 and the key=null
>>>>> for map2.
>>>>>
>>>>> This way, you only calculate reduce values for groupings you care
>>>>> about, and any particular reduce value is immediately available from
>>>>> the cached B+tree values without further computation. There is more
>>>>> burden on the user to specify ahead of time which groupings they need,
>>>>> but the performance and flexibility would be well worth it. This
>>>>> eliminates the need to store reduce values internally in the map B
>>>>> +tree. But it does mean that you would need a B+tree for each reduce
>>>>> grouping to keep incremental reduce updates fast. The improved
>>>>> performance comes from the fact that view queries would never need to
>>>>> aggregate reduce values across multiple nodes or do any re-reducing.
>>>>>
>>>>> Does this make sense? What do you guys think? Have you discussed the
>>>>> possibility of such a feature?
>>>>>
>>>>> I'd be happy to discuss it further and even help with the
>>>>> implementation, though I've only done a little bit of coding in
>>>>> Erlang. I'm pretty sure this would mean big changes to the couchdb
>>>>> internals, so I want to get your opinions and criticisms before I get
>>>>> my hopes up or dive into any coding.
>>>>>
>>>>> Cheers,
>>>>> Justin Balthrop
>>>>>
>>>>> .
>>>>>
>>>>>
>>>> Possible solution, I use it in my production ...
>>>> https://issues.apache.org/jira/browse/COUCHDB-249
>>>
>>>
>>
>>
>>
>> --
>> Chris Anderson
>> http://jchrisa.net
>> http://couch.io
>>
>

Mime
View raw message