couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zachary Zolton <zachary.zol...@gmail.com>
Subject Re: replicator options
Date Mon, 25 Jan 2010 16:28:35 GMT
Having the replicator handle chaining views would really help people
who are already hacking this together with scripts. So, I'd definitely
+1 the idea. Isn't view size and indexing time a separate problem from
designing this replicator API?

On Sun, Jan 24, 2010 at 9:47 PM, Chris Anderson <jchris@apache.org> wrote:
> On Sun, Jan 24, 2010 at 5:16 PM, Glenn Rempe <glenn@rempe.us> wrote:
>> On Sun, Jan 24, 2010 at 2:11 PM, Chris Anderson <jchris@apache.org> wrote:
>>
>>> On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <glenn@rempe.us> wrote:
>>> > On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <jchris@apache.org>
>>> wrote:
>>> >
>>> >> Devs,
>>> >>
>>> >> I've been thinking there are a few simple options that would magnify
>>> >> the power of the replicator a lot.
>>> >>
>>> >> ...
>>> >> The fun one is chained map reduce. It occurred to me the other night
>>> >> that simplest way to present a chainable map reduce abstraction to
>>> >> users is through the replicator. The action "copy these view rows to
a
>>> >> new db" is a natural fit for the replicator. I imagine this would be
>>> >> super useful to people doing big messy data munging, and it wouldn't
>>> >> be too hard for the replicator to handle.
>>> >>
>>> >>
>>> > I like this idea as well, as chainable map/reduce has been something I
>>> think
>>> > a lot of people would like to use.  The thing I am concerned about, and
>>> > which is related to another ongoing thread, is the size of views on disk
>>> and
>>> > the slowness of generating them.  I fear that we would end up ballooning
>>> > views on disk to a size that is unmanageable if we chained them.  I have
>>> an
>>> > app in production with 50m rows, whose DB has grown to >100GB, and the
>>> views
>>> > take up approx 800GB (!). I don't think I could afford the disk space to
>>> > even consider using this especially when you consider that in order to
>>> > compact a DB or view you need roughly 2x the disk space of the files on
>>> > disk.
>>> >
>>> > I also worry about the time to generate chained views, when the time
>>> needed
>>> > for generating views currently is already a major weak point of CouchDB
>>> > (Generating my views took more than a week).
>>> >
>>> > In practice, I think only those with relatively small DB's would be able
>>> to
>>> > take advantage of this feature.
>>> >
>>>
>>> For large data, you'll want a cluster. The same holds true for other
>>> Map Reduce frameworks like Hadoop or Google's stuff.
>>>
>>>
>>
>> That would not resolve the issue I mentioned where views can be a multiple
>> in size of the original data DB.  I have about 9 views in a design doc, and
>> my resultant view files on disk are about 9x the size of the original DB
>> data.
>>
>> How would sharding this across multiple DBs in a cluster resolve this?  You
>> would still end up with views that are some multiple in size of their
>> original sharded DB. Compounded by how many replicas you have of that view
>> data for chained M/R.
>>
>>
>>> I'd be interested if anyone with partitioned CouchDB query experience
>>> (Lounger or otherwise) can comment on view generation time when
>>> parallelized across multiple machines.
>>>
>>>
>> I would also be interested in seeing any architectures that make use of this
>> to parallelize view generation.  I'm not sure your example of Hadoop or
>> Google M/R are really valid because they provide file system abstractions
>> (e.g. Hadoop FS) for automatically streaming a single copy of the data to
>> where it is needed to be Mapped/Reduced and CouchDB has nothing similar.
>>
>> http://hadoop.apache.org/common/docs/current/hdfs_design.html
>>
>> Don't get me wrong, I would love to see these things happen, I just wonder
>> if there are other issues that need to be resolved first before this is
>> practical for anything but a small dataset.
>>
>
> I know Hadoop and Couch are dissimilar, but the way to parallelize
> CouchDB view generation is with a partitioned cluster like
> CouchDB-Lounge or the Cloudant stuff.
>
> It doesn't help much with the size inefficiencies but will help with
> generation time.
>
> Chris
>
>
> --
> Chris Anderson
> http://jchrisa.net
> http://couch.io
>

Mime
View raw message