couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Metson <simonmet...@googlemail.com>
Subject Re: Parallel view generation (was Re: replicator options)
Date Wed, 27 Jan 2010 11:18:57 GMT
Hey Chris,

>> The main issue here is how good your view server is; can it take  
>> getting
>> 1000's of responses at once? An HTTP view response would be nice...  
>> I'm
>> pretty sure that CouchDB could handle getting all the requests from  
>> workers.
>> I think this could also allow for view of view processing, without  
>> going
>> through/maintaining an intermediate database.
>
> The reason we haven't implemented something like this yet is that it
> assumes that your bottleneck is CPU-time, and that it's worth it to
> move docs across a cluster to be processed, then return the rows to
> the original Couch for indexing.
>
> This might help a little bit in cases where your map function is very
> CPU intensive, but you aren't going to get 8x faster by using 8 boxes,
> because the bottleneck will quickly become updating the view index
> file on the original Couch.
>
> Partitioning (a cluster of say 8 Couches where each Couch has 1/8th
> the data) will speed your view generation up by roughly 8x (at the
> expense of slightly higher http-query overhead.) In this approach, the
> map reduce model on an individual of one of those Couches isn't any
> different than it is today. The functions are run close to the data
> (no map reduce network overhead), and the rows are stored in 1 index
> per Couch, which is what allows the 8x speedup.
>
> Does this make sense?

Sure does. However, say I have a cluster of a few thousand machines  
(which I do) I'm not sure that will work. The map/reduce would be very  
CPU intensive (a map would be a simulation, lasting minutes per  
document, seeded from values in a couch document). The machines are in  
a traditional batch system, so setting up long-lived services on them  
(e.g. an instance of CouchDB) isn't going to fly. The documents and  
results are small but numerous (millions). I agree I won't get a  
linear scaling with the number of machines, but it might make  
something not possible in a standard couch cluster (because the map  
will take millions of minutes) possible.

Maybe Couch isn't a good fit for this and I should use Hadoop for  
everything, but it's fun to find out :)
Cheers
Simon

Mime
View raw message