From Sho Fukamachi <>
Subject Re: when to use another document and when not to?
Date Tue, 05 Aug 2008 18:43:14 GMT

On 05/08/2008, at 3:49 PM, Chris Anderson wrote:

> I think the missing link here is the ability to "remap" map and
> map/reduce results. In Hadoop-style map/reduce, the output of a single
> map will often be remapped in different ways for different purposes.
> Being able to share the intermediate results among further
> reprocessing is helpful, and often people will chain long stretches of
> map reduce processing.

Well, that would be absolutely fantastic if it came to pass. I didn't  
really think it was on the roadmap anytime soon though.

> The challenge for the CouchDB programming model for supporting chained
> map/reduces is the cache-expiry issue. How can we tell which index
> entries to sweep when a document is changed or deleted, when that
> index is itself generated by running map/reduce over another index? I
> tell myself that the bookkeeping is possible, but it sure sounds like
> a big job.

Hm. Not to risk exposing myself (accurately) as someone who has no  
grasp whatsoever of the complexities of such things - could a similar  
approach to the current _rev system be used?

Perhaps you could have two levels of revisions - one for the total  
view, which changed whenever anything in the view changed. That would  
signal the re-reducing view that it needed to go look at the view again.

And then a second level of revisions could be on individual key "row"  
output. The re-reduce could then just look at the ones that changed -  
it would simply drop any revs it had but didn't appear in the new  
listing, and import any that did - that would handle additions/ 
removals as well.

I'm probably oversimplifying things? Basically just trying to think of  
"the simplest thing that could possibly work" ...

> I have a prototype of remapping (with no cache-awareness) in
> CouchRest's git repo

Thanks - I'd been reading that anyway after noticing the bump to  
0.9.0. Looks great, will try it out!

> You're making sense, but I also wouldn't mind code examples :)

Sure, if you can stomach my awful code ...

In that (unedited, confused, messy) example I create and utilise two  
methods of getting at the membership data. The first is the one I  
discussed, ie caching it all in the Membership class. The second is to  
then place those caches in the "remote" record itself. Well, basically  
I just try lots of things, and it's all there if you can stand huge  
messes of experimentation : )

If you want to run it, you'll need edge DataMapper - as in, from the  
last 24 hours. This is probably the easiest way to get it:

Hope that's useful for someone.


> -- 
> Chris Anderson

