Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 3344 invoked from network); 8 Sep 2008 18:42:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Sep 2008 18:42:57 -0000 Received: (qmail 77314 invoked by uid 500); 8 Sep 2008 18:42:54 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 77278 invoked by uid 500); 8 Sep 2008 18:42:54 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 77266 invoked by uid 99); 8 Sep 2008 18:42:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 11:42:54 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates 209.85.200.170 as permitted sender) Received: from [209.85.200.170] (HELO wf-out-1314.google.com) (209.85.200.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Sep 2008 18:41:54 +0000 Received: by wf-out-1314.google.com with SMTP id 27so2035162wfd.21 for ; Mon, 08 Sep 2008 11:42:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender :to:subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references :x-google-sender-auth; bh=ssQj6CpCk+0XbISffVH+0mCkhaKVxMxwUwHMbSRznqw=; b=vvP4tJ90buVA1OWCTzcEPh6fByRTJJSL5f3mG36Xx3WmXvzF46oBgutDqqRggADZLf 1gVK2R4NDSef+iJ5EXLFUiJxhz9Dvmt9aR1XQylXY0Ew5OxJMreQXZQJ5maQL0/8j0lm vADnC5guFMd7CQCY9p1B9N+w2qb0TEk9teuJM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references:x-google-sender-auth; b=uLIMvbGc0ZtKWhbN/mHJvEX2S17WhqbDAF+sBZ1nOYsgFYrPA0dSYx3gw8I4BpVWKe WrL625AcgHnU/PvGlbUA8/rXhFWFnCYlFPiqfXJnyIV9apkb75jOmql5pAk11JRtROUk 7rh9SLZytOyvS0D+x3fZlDe3YnyGiU0+ZSH2c= Received: by 10.142.141.21 with SMTP id o21mr5361929wfd.213.1220899336328; Mon, 08 Sep 2008 11:42:16 -0700 (PDT) Received: by 10.142.49.15 with HTTP; Mon, 8 Sep 2008 11:42:16 -0700 (PDT) Message-ID: Date: Mon, 8 Sep 2008 11:42:16 -0700 From: "Chris Anderson" Sender: jchris@gmail.com To: couchdb-user@incubator.apache.org Subject: Re: couchdb and a large, large database In-Reply-To: <928bdd8e0809081125n72a0f9d3m7ad9da03c25ccc07@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <928bdd8e0809081125n72a0f9d3m7ad9da03c25ccc07@mail.gmail.com> X-Google-Sender-Auth: 5531e248748131b7 X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Sep 8, 2008 at 11:25 AM, william kinney wrote: > Hi, > > I was just wondering if anyone has had any experience with a couchdb > database of at least 100GB, maybe 2 TB? We are thinking of using it to store > some crawled data, but are unsure about the scalability of reading it after > it's been populated. After indexing it at 20MB/1901 records (about 2 > minutes), it takes a good 12 seconds to start returning the data on a > generic server. CouchDB should be able to handle that much data in terms of raw documents. Are you making a view request that is taking that much time? Document requests should be very fast even with vast databases. View requests (using design-docs, not temp-views) must run the view function on each document in turn, so generation time will be linear in proportion to your # of documents. However, once the views are generated, the query time should be wicked quick. There is not yet facility for parallelizing view-generation across nodes, but it is on the roadmap. I'm currently working with databases of a few 100k documents (from a directed web-crawl) and view generation is on the order of hours. However, I only have to do that when I redefine the map functions. Once they are generated, adding new data and incrementally updating the views is also linear in proportion to the amount of new data you've added. Maybe we can talk about web-spidering issues when I'm in NY. We use Nutch/Hadoop to gather data, and have HadoopStreaming job that uses Ruby to convert web pages to JSON for storage in CouchDB. Works well for our use-case. Chris -- Chris Anderson http://jchris.mfdz.com