Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 43349 invoked from network); 2 Jul 2008 18:19:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Jul 2008 18:19:31 -0000 Received: (qmail 20632 invoked by uid 500); 2 Jul 2008 18:19:32 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 20612 invoked by uid 500); 2 Jul 2008 18:19:32 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 20601 invoked by uid 99); 2 Jul 2008 18:19:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jul 2008 11:19:31 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.132.248 as permitted sender) Received: from [209.85.132.248] (HELO an-out-0708.google.com) (209.85.132.248) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jul 2008 18:18:38 +0000 Received: by an-out-0708.google.com with SMTP id b38so82946ana.83 for ; Wed, 02 Jul 2008 11:18:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=Ps9I5WPrM9CVJr1YkcFpcvo7YM96iFBSd0Io0hj5ZLY=; b=go7FapHFvTzWqboZdOcaBJ65M8BkzFEMI5iGyhOxm+4/zxInLx+hecS+vTj0S3ENSs 8ZK5RHXnce7IzRyGWc2FB3T5rIquxESkHJoF5r5xDYx4h4APSTSRQcipFvkV5BIfn9sV x2PDq7DDIApLsVcEm0g8/1GDQBErZht0A7g5M= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=LBgT6bKJtqXvwjnhiSgxGboi69tglkXCnPYaJT6yI4+UpY/USjL6uFwtD6jevq8bzC w9bBkNZrJr3CNS6tjORqCSJ07XIyKwjd2jUsa9O2F0eUSdbHrtT9tEK+fKyMS46LJHxD jprcrPgrzGbqR0tk1ylvefX5anFpdgkKscCU8= Received: by 10.101.71.9 with SMTP id y9mr7553838ank.145.1215022738343; Wed, 02 Jul 2008 11:18:58 -0700 (PDT) Received: by 10.100.154.5 with HTTP; Wed, 2 Jul 2008 11:18:58 -0700 (PDT) Message-ID: Date: Wed, 2 Jul 2008 14:18:58 -0400 From: "Paul Davis" To: couchdb-user@incubator.apache.org Subject: Re: view index build time In-Reply-To: <888cd9180807021058j12bab0eh18315ae3e57f0d36@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <888cd9180806201345k423a4dc8j64a881f351bd60bd@mail.gmail.com> <888cd9180806201351u1373f465r606d9c37d1903a19@mail.gmail.com> <888cd9180807010626n4ed66f19td1e8eac93f549f81@mail.gmail.com> <888cd9180807020717h64b12496x6dac9113eea41bf4@mail.gmail.com> <888cd9180807021058j12bab0eh18315ae3e57f0d36@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org One thing that got me awhile back was the version of erlang I was using. If you're not on one of the most recent erlang versions R12B or some such, you might try upgrading that bit to see if it fixes things. Paul On Wed, Jul 2, 2008 at 1:58 PM, Brad King wrote: > I created a view with emit(doc.entityobject.sku, null) to only emit > the doc ids. After trying attachments, I nuked the DB and started > over, going back to having the documents inline. This is ok, but > again, the index build time of about 25 minutes for this view against > 300K or so docs seems long. What are you seeing as typical for > creating your views against a much larger set? What do your docs look > like? Thanks. > > > On Wed, Jul 2, 2008 at 10:50 AM, Jan Lehnardt wrote: >> >> On Jul 2, 2008, at 16:17, Brad King wrote: >> >>> Just to post some results here of working with around 300K docs. I >>> changed the view to emit only the doc ID and index time went down to >>> about 25 minutes vs. an hour for the same dataset. >>> >>> I then converted the largest text field to an attachment and things >>> went down hill from there. I deleted the db and started the upload, >>> but repeatedly got random 500 server errors with no real way to know >>> what is happening or why. Also the DB size as reported by Futon seemed >>> to fluctuate wildly as I was adding documents. And I mean wildly like >>> anywhere from 1.2G then back down to 144M. Weird. I don't get a very >>> warm fuzzy feeling about the stability of using attachments right now. >>> Ideally, I don't want to use them anyway, I'd prefer to have the >>> fields all inline and have the database handle these docs as-is. I >>> don't see these as huge documents (2 to 5K) as compared to what I >>> would store in something like Berkeley DB XML, just for comparison >>> sake, so I'm hoping its a goal of the project to handle these >>> effectively, even when several million documents are added. >> >> This doesn't sound right at all. Can you make sure you use the >> very latest SVN version or the 0.8 release and completely >> new databases? Also, just to clarify, do you emit the doc into >> the view payload? As in emit(doc._id, doc); are you just doing >> emit(null, null); to only get the docIds that matter to you and >> then fetch the documents later? I have had the later setup running >> without any problems across ~2mio documents in a database. >> >> >>> As always, thanks for the help. >> >> Thanks for the problem report. >> >> Cheers >> Jan >> -- >> >>> >>> >>> >>> >>> On Tue, Jul 1, 2008 at 9:26 AM, Brad King wrote: >>>> >>>> Thanks for the tips. I'll start scaling back the data I'm returning >>>> and see if it improves. The largest field is an html description of an >>>> inventory item, which seems like a good candidate for a binary >>>> attachment, but I need to be able to do full text searches on this >>>> data eventually (hopefully with the Lucene integration) so I'll >>>> probably try just not including the document data in the views first. >>>> We've had some success with Lucene independent of couchdb, so I'm >>>> pleased you guys are integrating this. >>>> >>>> On Sat, Jun 21, 2008 at 8:39 AM, Damien Katz >>>> wrote: >>>>> >>>>> Part of the problem is you are storing copies of the documents into the >>>>> btree. If the documents are big, it takes longer to compute on them, and >>>>> if >>>>> the results (emit(...)) are big or numerous, then you'll be spending >>>>> most of >>>>> your time in I/O. >>>>> >>>>> My advice is to not emit the document into the view, and if you can, get >>>>> the >>>>> documents smaller in general. If the data can stored as an binary >>>>> attachment, then that too will give you a performance improvement. >>>>> >>>>> -Damien >>>>> >>>>> On Jun 20, 2008, at 4:51 PM, Brad King wrote: >>>>> >>>>>> Thanks, yes its currently at 357M and growing! >>>>>> >>>>>> On Fri, Jun 20, 2008 at 4:49 PM, Chris Anderson >>>>>> wrote: >>>>>>> >>>>>>> Brad, >>>>>>> >>>>>>> You can look at >>>>>>> >>>>>>> ls -lha /usr/local/var/lib/couchdb/.my-dbname_design/ >>>>>>> >>>>>>> to see the view size growing... >>>>>>> >>>>>>> It won't tell you when it's done but it will give you hope that the >>>>>>> progress is happening. >>>>>>> >>>>>>> Chris >>>>>>> >>>>>>> On Fri, Jun 20, 2008 at 1:45 PM, Brad King wrote: >>>>>>>> >>>>>>>> I have about 350K documents in a database. typically around 5K each. >>>>>>>> I >>>>>>>> created and saved a view which simply looks at one field in the >>>>>>>> document. I called the view for the first time with a key that should >>>>>>>> only match one document, and its been awaiting a response for about >>>>>>>> 45 >>>>>>>> minutes now. >>>>>>>> >>>>>>>> { >>>>>>>> "sku": { >>>>>>>> "map": "function(doc) { emit(doc.entityobject.SKU, doc); }" >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> Is this typical, or is there some optimizing to be done on either my >>>>>>>> view or the server? I'm also running on a VM so this may have some >>>>>>>> effects, but smaller databases seem to be performing pretty well. >>>>>>>> Insert times to set this up were actually really good I thought, at >>>>>>>> 4000 to 5000 documents per minute running from my laptop. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Chris Anderson >>>>>>> http://jchris.mfdz.com >>>>>>> >>>>> >>>>> >>>> >>> >> >> >