couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: replicator options
Date Sun, 24 Jan 2010 22:11:23 GMT
On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <glenn@rempe.us> wrote:
> On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <jchris@apache.org> wrote:
>
>> Devs,
>>
>> I've been thinking there are a few simple options that would magnify
>> the power of the replicator a lot.
>>
>> ...
>> The fun one is chained map reduce. It occurred to me the other night
>> that simplest way to present a chainable map reduce abstraction to
>> users is through the replicator. The action "copy these view rows to a
>> new db" is a natural fit for the replicator. I imagine this would be
>> super useful to people doing big messy data munging, and it wouldn't
>> be too hard for the replicator to handle.
>>
>>
> I like this idea as well, as chainable map/reduce has been something I think
> a lot of people would like to use.  The thing I am concerned about, and
> which is related to another ongoing thread, is the size of views on disk and
> the slowness of generating them.  I fear that we would end up ballooning
> views on disk to a size that is unmanageable if we chained them.  I have an
> app in production with 50m rows, whose DB has grown to >100GB, and the views
> take up approx 800GB (!). I don't think I could afford the disk space to
> even consider using this especially when you consider that in order to
> compact a DB or view you need roughly 2x the disk space of the files on
> disk.
>
> I also worry about the time to generate chained views, when the time needed
> for generating views currently is already a major weak point of CouchDB
> (Generating my views took more than a week).
>
> In practice, I think only those with relatively small DB's would be able to
> take advantage of this feature.
>

For large data, you'll want a cluster. The same holds true for other
Map Reduce frameworks like Hadoop or Google's stuff.

I'd be interested if anyone with partitioned CouchDB query experience
(Lounger or otherwise) can comment on view generation time when
parallelized across multiple machines.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message