From user-return-4379-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Sun Apr 12 17:19:40 2009 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 7926 invoked from network); 12 Apr 2009 17:19:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Apr 2009 17:19:40 -0000 Received: (qmail 41370 invoked by uid 500); 12 Apr 2009 17:19:39 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 41270 invoked by uid 500); 12 Apr 2009 17:19:39 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 41260 invoked by uid 99); 12 Apr 2009 17:19:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Apr 2009 17:19:39 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.132.250 as permitted sender) Received: from [209.85.132.250] (HELO an-out-0708.google.com) (209.85.132.250) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Apr 2009 17:19:29 +0000 Received: by an-out-0708.google.com with SMTP id b2so987559ana.5 for ; Sun, 12 Apr 2009 10:19:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=oTnVbAYOKyIAUsPcZEFpKEsNv072Zs6bsD06Y3lPc/c=; b=t7ny3+YTDPaEidZo8fmcbGNfTlv54YfGj2cKIEkHUw8F/LaxkIik+/aJUsu+WFY0rI Mcl4Xd+MP/22467+8/5adxhalFG7GcAYdByFBIasxyFElp//XEu5SPfnqAvvGoavSq8q c5J2x0JJcfr+PW2l4Qw2VPpkbmFwuNhCwv+7s= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=N4rNpBa5e7MlV0YAsCkcOCyY5UiLTy7s/MLov2wiH2j36K9wSGk3whFIu6SJfDbL5L P2/yOPoEs8Llya7AZeDGJ5Z6SEEOoJqCZDNA8JIVdCEbNaV1b7CzOrFhf2pNVQEuDkq1 +EiLZ95E5+lrZiesp/mHiHh9eujriGXZ9rfLk= MIME-Version: 1.0 Received: by 10.100.137.12 with SMTP id k12mr5868482and.55.1239556748032; Sun, 12 Apr 2009 10:19:08 -0700 (PDT) In-Reply-To: References: Date: Sun, 12 Apr 2009 13:19:07 -0400 Message-ID: Subject: Re: Some guidance with extremely slow indexing From: Paul Davis To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Sun, Apr 12, 2009 at 12:51 PM, Kenneth Kalmer wrote: > On Sun, Apr 12, 2009 at 1:11 AM, Chris Anderson wrote: > >> On Sat, Apr 11, 2009 at 12:06 PM, Paul Davis >> wrote: >> > On Sat, Apr 11, 2009 at 2:58 PM, Kenneth Kalmer >> > wrote: >> >> On Thu, Apr 9, 2009 at 5:17 PM, Paul Davis > >wrote: >> >> >> >>> Kenneth, >> >>> >> >>> I'm pretty sure you're issue is in the reduce steps for the daily and >> >>> montly views. The general rule of thumb is that you shouldn't be >> >>> returning data that grows faster than log(#keys processed) where as I >> >>> believe your data is growing linearly with input. >> >>> >> >>> This particular limitation is a result of the implementation of >> >>> incremental reductions. Basically, each key/pointer pair stores the >> >>> re-reduced value for all [re-]reduce values in its children nodes. So >> >>> as your reduction moves up the tree the data starts exploding which >> >>> kills btree performance not to mention the extra file I/O. >> >>> >> >>> The basic moral of the story is that if you want reduce views like >> >>> this per user you should emit a [user_id, date] pair as the key and >> >>> then call your reduce views with group=true. >> >>> >> >>> HTH, >> >>> Paul Davis >> >>> >> >> >> >> Hi Paul >> >> >> >> Thanks for taking the trouble of investigating for me, I'll dive into >> the >> >> views and clean them up a bit according to your advice as well as brush >> up >> >> on the caveat you explained. I saw other threads in the archives where >> you >> >> gave similar advice, sorry for not realizing I stepped into the same >> trap. >> >> When I've got the issue resolved I'll update the gist and we can leave >> it as >> >> a point of reference for others. >> >> >> >> Thanks again! >> >> >> > >> > Its kind of a hard one to notice right away as its not an error, it >> > just kills performance. Perhaps Damien was right in that we should >> > think about adding log vomiting when we detect that there's a crap >> > load of data accumulating in the reductions. >> > >> >> I agree -- maybe another config setting >> max_intermediate_reduction_size or something. So that you can raise it >> if you really know what you are doing. Unless there are hard-limits, >> in which case we should just error properly when we reach them. >> > > Hi Paul & Chris > > This would help, I'm sure a lot of people would be caught in this trap > initially. > > I've cleaned up my views a bit and the are much more performant now. On our > "production" couch where there is currently 6.6 million docs now the > indexing has been running now for close to 18 hours and is 80% done. I > killed the previous indexing task, since after 5 days it was only > 50-something percent done with 3.1 million docs at the time it started. > Yeah, that sounds much closer to the expected performance. > After going through the docs carefully again and clearly thinking through my > problem, as well as taking the "emit([key, doc.user])" advice from Paul more > seriously I got it working. The docs gives the warning, without any real > references, making it sound like a "yeah whatever" kinda thing. I've updated the wiki with hopefully a more stern warning about the expected data characteristics of reduce functions. This is > dangerous. However the realm gem lies in a line I picked up somewhere in the > wiki, it stresses that the reduce views should build a summary, not > aggregate data, which was my mistake. I now aggregate the data in my own app > with two extra lines of code and the views now become very powerful using > group_level. So my old 'days' and 'daily' views are now combined in a > single, more useful, 'daily' view. > > I'll update the gist as soon as my DSL is fixed at home and blog on my > learning curve as well, as soon as I can conjure up a nice example for > rereduce, which I also only figured out through this excercise. > > Thanks again for helping the newbies, the willingness of everyone here to > assist definitely helps drive couch adoption. > > Best > > -- > Kenneth Kalmer > kenneth.kalmer@gmail.com > http://opensourcery.co.za > HTH, Paul Davis