couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: replicator options
Date Mon, 25 Jan 2010 01:27:13 GMT
On Sun, Jan 24, 2010 at 8:16 PM, Glenn Rempe <glenn@rempe.us> wrote:
> On Sun, Jan 24, 2010 at 2:11 PM, Chris Anderson <jchris@apache.org> wrote:
>
>> On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <glenn@rempe.us> wrote:
>> > On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <jchris@apache.org>
>> wrote:
>> >
>> >> Devs,
>> >>
>> >> I've been thinking there are a few simple options that would magnify
>> >> the power of the replicator a lot.
>> >>
>> >> ...
>> >> The fun one is chained map reduce. It occurred to me the other night
>> >> that simplest way to present a chainable map reduce abstraction to
>> >> users is through the replicator. The action "copy these view rows to a
>> >> new db" is a natural fit for the replicator. I imagine this would be
>> >> super useful to people doing big messy data munging, and it wouldn't
>> >> be too hard for the replicator to handle.
>> >>
>> >>
>> > I like this idea as well, as chainable map/reduce has been something I
>> think
>> > a lot of people would like to use.  The thing I am concerned about, and
>> > which is related to another ongoing thread, is the size of views on disk
>> and
>> > the slowness of generating them.  I fear that we would end up ballooning
>> > views on disk to a size that is unmanageable if we chained them.  I have
>> an
>> > app in production with 50m rows, whose DB has grown to >100GB, and the
>> views
>> > take up approx 800GB (!). I don't think I could afford the disk space to
>> > even consider using this especially when you consider that in order to
>> > compact a DB or view you need roughly 2x the disk space of the files on
>> > disk.
>> >
>> > I also worry about the time to generate chained views, when the time
>> needed
>> > for generating views currently is already a major weak point of CouchDB
>> > (Generating my views took more than a week).
>> >
>> > In practice, I think only those with relatively small DB's would be able
>> to
>> > take advantage of this feature.
>> >
>>
>> For large data, you'll want a cluster. The same holds true for other
>> Map Reduce frameworks like Hadoop or Google's stuff.
>>
>>
>
> That would not resolve the issue I mentioned where views can be a multiple
> in size of the original data DB.  I have about 9 views in a design doc, and
> my resultant view files on disk are about 9x the size of the original DB
> data.
>
> How would sharding this across multiple DBs in a cluster resolve this?  You
> would still end up with views that are some multiple in size of their
> original sharded DB. Compounded by how many replicas you have of that view
> data for chained M/R.
>
>
>> I'd be interested if anyone with partitioned CouchDB query experience
>> (Lounger or otherwise) can comment on view generation time when
>> parallelized across multiple machines.
>>
>>
> I would also be interested in seeing any architectures that make use of this
> to parallelize view generation.  I'm not sure your example of Hadoop or
> Google M/R are really valid because they provide file system abstractions
> (e.g. Hadoop FS) for automatically streaming a single copy of the data to
> where it is needed to be Mapped/Reduced and CouchDB has nothing similar.
>
> http://hadoop.apache.org/common/docs/current/hdfs_design.html
>
> Don't get me wrong, I would love to see these things happen, I just wonder
> if there are other issues that need to be resolved first before this is
> practical for anything but a small dataset.
>

Theoretically the bigger your view is, the more inflated it'll become.
Those would be some interesting numbers. Seems like they'd be helpful
in choosing shard sizes and so on.

Mime
View raw message