couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Anderson" <jch...@grabb.it>
Subject Re: couchdb and a large, large database
Date Mon, 08 Sep 2008 18:42:16 GMT
On Mon, Sep 8, 2008 at 11:25 AM, william kinney
<william.kinney@gmail.com> wrote:
> Hi,
>
> I was just wondering if anyone has had any experience with a couchdb
> database of at least 100GB, maybe 2 TB? We are thinking of using it to store
> some crawled data, but are unsure about the scalability of reading it after
> it's been populated. After indexing it at 20MB/1901 records (about 2
> minutes), it takes a good 12 seconds to start returning the data on a
> generic server.

CouchDB should be able to handle that much data in terms of raw documents.

Are you making a view request that is taking that much time? Document
requests should be very fast even with vast databases.

View requests (using design-docs, not temp-views) must run the view
function on each document in turn, so generation time will be linear
in proportion to your # of documents. However, once the views are
generated, the query time should be wicked quick.

There is not yet facility for parallelizing view-generation across
nodes, but it is on the roadmap.

I'm currently working with databases of a few 100k documents (from a
directed web-crawl) and view generation is on the order of hours.
However, I only have to do that when I redefine the map functions.
Once they are generated, adding new data and incrementally updating
the views is also linear in proportion to the amount of new data
you've added.

Maybe we can talk about web-spidering issues when I'm in NY. We use
Nutch/Hadoop to gather data, and have HadoopStreaming job that uses
Ruby to convert web pages to JSON for storage in CouchDB. Works well
for our use-case.

Chris


-- 
Chris Anderson
http://jchris.mfdz.com

Mime
View raw message