Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of justin@geni.com designates
 208.78.87.26 as permitted sender)
Message-Id: <4910BDEC-FA8A-4C64-8C56-1CE4F2E89191@geni.com>
From: Justin Balthrop <justin@geni.com>
To: dev@couchdb.apache.org
In-Reply-To: <e282921e0906031229h5f2718b4v81d3f9d4268b35b6@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v935.3)
Subject: Re: multi-level views
Date: Wed, 3 Jun 2009 13:24:12 -0700
References: <1D7D2ECC-A0F7-4E04-B1FA-299132A1B1B3@geni.com>
 <4A263772.50204@avicomp.com> <C909DB85-DAE8-4605-882D-3EBCA553D289@geni.com>
 <e282921e0906031229h5f2718b4v81d3f9d4268b35b6@mail.gmail.com>

Hmm... I'm not sure how performance would be with copying to a new  
database (option number 2 below). We were seriously considering this  
option for our application, but I'm afraid of the contention for  
simultaneous writes. I guess I won't know for sure unless I try it.

If I was going to go down this road though, I would almost rather just  
store the full document rather than slicing it up in the first place.  
But then will the lack of partial updates kill us? These documents can  
get pretty big, and we would have to transfer the entire document for  
every modification. Or am I overestimating the cost of this? One  
reason I like chainable map/reduce, is that it really eliminates the  
need for partial updates entirely. Just slice your documents up into  
whatever atomic piece gives you the best concurrent performance, and  
then easily assemble them back into whole documents in the first map/ 
reduce phase.

I'm also still worried about reduce performance. If I have a map key  
that has thousands of values, then all of those need to be reduced  
every time a document adds or removes a value from that key. If the  
values on the leaf nodes were each stored in another btree (this could  
be based on a threshold so you don't create btrees for very small sets  
of values), then only a small number of reduces would be necessary  
when adding or removing map values for a given key. And whenever a  
reduce value changes, you just cascade that modification to the next  
map step just like you would if a document value changes. This should  
allow you to experiment with subsequent steps all you want without  
ever regenerating the results for the previous map/reduce steps.

Justin

On Jun 3, 2009, at 12:29 PM, Chris Anderson wrote:

> On Wed, Jun 3, 2009 at 12:03 PM, Justin Balthrop <justin@geni.com>  
> wrote:
>> Nice! That sounds like exactly what I'm looking for. I don't think  
>> it will
>> address the performance issues with reduce, but it's definitely a  
>> start.
>>
>> Do you mind sending a diff of your changes to  
>> couch_view_updater.erl? I
>> diffed your file with trunk and there are a bunch of unrelated  
>> changes, of
>> course.
>
> There's also a Paul Davis's Cascade:
> http://github.com/davisp/cascade/tree/master
>
> I'm planning on writing something with Hovercraft that takes a group
> reduce query and copies it to another database on demand. It wouldn't
> try to be incremental, just provide for easy chaining.
>
> I think chaining by copying to a db is a good way to work, because it
> lets you experiment with other views on top of your reduce rows,
> without regenerating the whole thing.
>
> Chris
>
>>
>> Thanks
>>
>>
>> On Jun 3, 2009, at 1:42 AM, Viacheslav Seledkin wrote:
>>
>>> Justin Balthrop wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I've been reading the dev and user mailing lists for the past  
>>>> month or
>>>> so, but haven't posted yet. I've fallen in love with couchdb, its
>>>> power and simplicity, and I tell everyone who will listen why it  
>>>> is so
>>>> much better than a relational db for most applications. I now have
>>>> most of the engineering team at our company on board, and I'm in  
>>>> the
>>>> process of converting our rails site from postgres to couchdb.
>>>>
>>>> So, after spending a few weeks converting models over to using
>>>> couchdb, there is one feature that we are desperately missing:
>>>>
>>>> Multi-level map-reduce in views.
>>>>
>>>> We need a way to take the output of reduce and pass it back through
>>>> another map-reduce step (multiple times in some cases). This way,  
>>>> we
>>>> could build map-reduce flows that compute (and cache) any complex  
>>>> data
>>>> computation we need.
>>>>
>>>> Our specific use case isn't incredibly important, because multi- 
>>>> level
>>>> map-reduce could be useful in countless ways, but I'll include it
>>>> anyway just as illustration. The specific need for us arose from  
>>>> the
>>>> desire to slice up certain very large documents to make concurrent
>>>> editing by a huge number of users feasible. Then we started to  
>>>> use a
>>>> view step to combine the data back into whole documents. This  
>>>> worked
>>>> really well at first, but we soon found that we needed to run
>>>> additional queries on those documents. So we were stuck with  
>>>> either:
>>>>
>>>> 1) do the queries in the client - meaning we lose all the power and
>>>> caching of couchdb views; or
>>>> 2) reinsert the combined documents into another database -  
>>>> meaning we
>>>> are storing the data twice, and we still have to deal with  
>>>> contention
>>>> when modifying the compound documents in that database.
>>>>
>>>> Multi-level map-reduce would solve this problem perfectly!
>>>>
>>>> Multi-level views could also simplify and improve performance for
>>>> reduce grouping. The reduce itself would work just like Google's  
>>>> map-
>>>> reduce by only reducing values that have the exact same map key.  
>>>> Then
>>>> if you want to reduce further, you can just use another map-reduce
>>>> step on top of that with the map emitting a different key so the
>>>> reduce data will be grouped differently. For example, if you  
>>>> wanted a
>>>> count of posts per user and total posts, you would implement it  
>>>> as a
>>>> two-level map-reduce with the key=user_id for map1 and the key=null
>>>> for map2.
>>>>
>>>> This way, you only calculate reduce values for groupings you care
>>>> about, and any particular reduce value is immediately available  
>>>> from
>>>> the cached B+tree values without further computation. There is more
>>>> burden on the user to specify ahead of time which groupings they  
>>>> need,
>>>> but the performance and flexibility would be well worth it. This
>>>> eliminates the need to store reduce values internally in the map B
>>>> +tree. But it does mean that you would need a B+tree for each  
>>>> reduce
>>>> grouping to keep incremental reduce updates fast. The improved
>>>> performance comes from the fact that view queries would never  
>>>> need to
>>>> aggregate reduce values across multiple nodes or do any re- 
>>>> reducing.
>>>>
>>>> Does this make sense? What do you guys think? Have you discussed  
>>>> the
>>>> possibility of such a feature?
>>>>
>>>> I'd be happy to discuss it further and even help with the
>>>> implementation, though I've only done a little bit of coding in
>>>> Erlang. I'm pretty sure this would mean big changes to the couchdb
>>>> internals, so I want to get your opinions and criticisms before I  
>>>> get
>>>> my hopes up or dive into any coding.
>>>>
>>>> Cheers,
>>>> Justin Balthrop
>>>>
>>>> .
>>>>
>>>>
>>> Possible solution, I use it in my production ...
>>> https://issues.apache.org/jira/browse/COUCHDB-249
>>
>>
>
>
>
> -- 
> Chris Anderson
> http://jchrisa.net
> http://couch.io