Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates
 209.85.200.170 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:sender:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references:x-google-sender-auth;
        b=uLIMvbGc0ZtKWhbN/mHJvEX2S17WhqbDAF+sBZ1nOYsgFYrPA0dSYx3gw8I4BpVWKe
         WrL625AcgHnU/PvGlbUA8/rXhFWFnCYlFPiqfXJnyIV9apkb75jOmql5pAk11JRtROUk
         7rh9SLZytOyvS0D+x3fZlDe3YnyGiU0+ZSH2c=
Message-ID: <e282921e0809081142y7e09f177j1c2c8ca313da92a7@mail.gmail.com>
Date: Mon, 8 Sep 2008 11:42:16 -0700
From: "Chris Anderson" <jchris@grabb.it>
Sender: jchris@gmail.com
To: couchdb-user@incubator.apache.org
Subject: Re: couchdb and a large, large database
In-Reply-To: <928bdd8e0809081125n72a0f9d3m7ad9da03c25ccc07@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <928bdd8e0809081125n72a0f9d3m7ad9da03c25ccc07@mail.gmail.com>

On Mon, Sep 8, 2008 at 11:25 AM, william kinney
<william.kinney@gmail.com> wrote:
> Hi,
>
> I was just wondering if anyone has had any experience with a couchdb
> database of at least 100GB, maybe 2 TB? We are thinking of using it to store
> some crawled data, but are unsure about the scalability of reading it after
> it's been populated. After indexing it at 20MB/1901 records (about 2
> minutes), it takes a good 12 seconds to start returning the data on a
> generic server.

CouchDB should be able to handle that much data in terms of raw documents.

Are you making a view request that is taking that much time? Document
requests should be very fast even with vast databases.

View requests (using design-docs, not temp-views) must run the view
function on each document in turn, so generation time will be linear
in proportion to your # of documents. However, once the views are
generated, the query time should be wicked quick.

There is not yet facility for parallelizing view-generation across
nodes, but it is on the roadmap.

I'm currently working with databases of a few 100k documents (from a
directed web-crawl) and view generation is on the order of hours.
However, I only have to do that when I redefine the map functions.
Once they are generated, adding new data and incrementally updating
the views is also linear in proportion to the amount of new data
you've added.

Maybe we can talk about web-spidering issues when I'm in NY. We use
Nutch/Hadoop to gather data, and have HadoopStreaming job that uses
Ruby to convert web pages to JSON for storage in CouchDB. Works well
for our use-case.

Chris


-- 
Chris Anderson
http://jchris.mfdz.com