couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Parallel view generation (was Re: replicator options)
Date Mon, 25 Jan 2010 18:10:55 GMT
On Mon, Jan 25, 2010 at 3:10 AM, Simon Metson
<simonmetson@googlemail.com> wrote:
> Hi,
>        This is OT for the original discussion imho
>
> On 25 Jan 2010, at 01:16, Glenn Rempe wrote:
>
>>> I'd be interested if anyone with partitioned CouchDB query experience
>>> (Lounger or otherwise) can comment on view generation time when
>>> parallelized across multiple machines.
>>>
>>>
>> I would also be interested in seeing any architectures that make use of
>> this
>> to parallelize view generation.  I'm not sure your example of Hadoop or
>> Google M/R are really valid because they provide file system abstractions
>> (e.g. Hadoop FS) for automatically streaming a single copy of the data to
>> where it is needed to be Mapped/Reduced and CouchDB has nothing similar.
>
> IMHO something like HDFS isn't needed, since there's already a simple,
> scalable way of getting at the data. What I'd like (to have time to work
> on...) is the following:
>
> 1. be able to configure a pipeline of documents that are sent to the view
> server
>        1a. be able to set the size of that pipeline to 0, which just sends a
> sane header (there are N documents in the database)
> 2. view server spawns off child processes (I'm thinking Disco, but Hadoop
> would be able to do the same) on the various worker nodes
> 3. each worker is given a range of documents to process, pulls these in from
> _all_docs
> 4. worker processes its portion of the database
> 5. worker returns its results to the view server which aggregates them up
> into the final view
>
> The main issue here is how good your view server is; can it take getting
> 1000's of responses at once? An HTTP view response would be nice... I'm
> pretty sure that CouchDB could handle getting all the requests from workers.
> I think this could also allow for view of view processing, without going
> through/maintaining an intermediate database.

The reason we haven't implemented something like this yet is that it
assumes that your bottleneck is CPU-time, and that it's worth it to
move docs across a cluster to be processed, then return the rows to
the original Couch for indexing.

This might help a little bit in cases where your map function is very
CPU intensive, but you aren't going to get 8x faster by using 8 boxes,
because the bottleneck will quickly become updating the view index
file on the original Couch.

Partitioning (a cluster of say 8 Couches where each Couch has 1/8th
the data) will speed your view generation up by roughly 8x (at the
expense of slightly higher http-query overhead.) In this approach, the
map reduce model on an individual of one of those Couches isn't any
different than it is today. The functions are run close to the data
(no map reduce network overhead), and the rows are stored in 1 index
per Couch, which is what allows the 8x speedup.

Does this make sense?

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message