Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 77014 invoked from network); 4 Jul 2009 00:36:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 4 Jul 2009 00:36:10 -0000 Received: (qmail 53496 invoked by uid 500); 4 Jul 2009 00:36:20 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 53410 invoked by uid 500); 4 Jul 2009 00:36:20 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 53400 invoked by uid 99); 4 Jul 2009 00:36:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Jul 2009 00:36:20 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adam.kocoloski@gmail.com designates 209.85.221.191 as permitted sender) Received: from [209.85.221.191] (HELO mail-qy0-f191.google.com) (209.85.221.191) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Jul 2009 00:36:09 +0000 Received: by qyk29 with SMTP id 29so671024qyk.13 for ; Fri, 03 Jul 2009 17:35:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:date:references:x-mailer; bh=+JI2jsVbcMqy6iOJYWVKvibr5uTJSZzPATliudzTcmg=; b=ZUpPeyY8iiBe3p6nk60IoqT2eQgTw16TjrvplECFTdsOAt35AFua8gazRIySanVjn4 zTVc6nDWUIXkLhOBz6oBSzyPQERV5hgQclm9G7aGPMa6hCy4h41lMWWpz8r73RAkl+nB /vO1dvEwchTHrx9UAlZItw4yGUT4ytvTUnXkI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:date:references :x-mailer; b=RPgpZfBK8QGcJcdMZi3UAUiA10R4QC7pu295Y6aQY90p3rTYszPurKsAcO69T4j7a9 huEicv4fLFXoc7+HsKF1uiIsKgL2ci+iu+sEz3X3wFSTePTUe5MJYtgPW5iDrT/XY1Wh YZyq1BPPrddAOvyTicl+IHCGe8yrjkvY8plcw= Received: by 10.224.37.83 with SMTP id w19mr2675643qad.132.1246667748004; Fri, 03 Jul 2009 17:35:48 -0700 (PDT) Received: from ?10.0.1.2? (c-71-232-49-44.hsd1.ma.comcast.net [71.232.49.44]) by mx.google.com with ESMTPS id 5sm305374qwg.45.2009.07.03.17.35.46 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 03 Jul 2009 17:35:47 -0700 (PDT) Sender: Adam Kocoloski Message-Id: From: Adam Kocoloski To: dev@couchdb.apache.org In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Possible bug in indexer... (really) Date: Fri, 3 Jul 2009 20:35:45 -0400 References: <4A4E7F53.7010406@krampe.se> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org On Jul 3, 2009, at 6:37 PM, Chris Anderson wrote: > 2009/7/3 G=F6ran Krampe : >> Hi folks! >> >> We are writing an app using CouchDB where we tried to do some map/=20 >> reduce to >> calculate "period sums" for about 1000 different "accounts". This =20 >> is fiscal >> data btw, the system is meant to store detailed fiscal data for =20 >> about 50000 >> companies, for starters. :) >> >> The map function is trivial, it just emits a bunch of "accountNo, =20 >> amount" >> pairs with "month" as key. >> >> The reduce/rereduce take these and builds a dictionary (JSON =20 >> object) with >> "month-accountNo" as key (like "2009/10-2335" and the sum as the =20 >> value. This >> works fine, yes, it builds up a bit but there is a maximum of account >> numbers and months so it doesn't grow out of control, so that is =20 >> NOT the >> issue. > > There is *no reason ever* to build up a dictionary with more then a > small handful of items in it. Eg it's ok if your dictionary has this > fixed set of keys: count, total, stddev, avg. > > It's not OK to do what you are doing. This is what group_level is for. > Rewrite your map reduce to be correct and then we can start talking > about performance. > > I don't mean to be harsh but suggesting you have a performance problem > here is like me complaining that my Ferrari makes a bad boat. > > Cheers, > Chris Wow, that was unusually harsh coming from you, Chris. Taking a closer =20= look at G=F6ran's map and reduce functions I agree that they should be =20= reworked to make use of group=3Dtrue, but nevertheless I wonder if we do = =20 have something to work on here. I believe G=F6ran's problem was that the second pass was causing the =20 view updater process to use a significant amount of memory and trigger =20= should_flush() immediately. As a result, view KVs were being written =20= to disk after every document (triggering the reduce/rereduce step). =20 This is fantastically inefficient. If the penalty for flushing =20 results to disk during indexing is so severe, perhaps we want to be a =20= little more flexible in imposing it. There could be very legitimate =20 cases where users with large documents and/or sophisticated workflows =20= are hung out to dry during indexing because the view updater wants a =20 measly 11MB of memory to do its job. Adam >> Ok, here comes the punchline. When we dump the first 1000 docs =20 >> using bulk, >> which typically will amount to say 5000 emits - and we "touch" the =20= >> view to >> trigger it - it will be rather fast and behaves like this: >> >> - a single Erlang process runs and emits all values, then it does a =20= >> bunch or >> reduce on those values and finally it switches into rereduce mode =20 >> and does >> those and then you can see the dictionary "growing" a bit but never =20= >> too >> much. It is pretty fast, a second or two all in all. >> >> Fine. Them we dump the *next* 1000 docs into Couch and triggers the =20= >> view >> again. This time it behaves like this (believe it or not): >> >> - two Erlang processes get into play. It seems the same process as =20= >> above >> continues with emits (IIRC) but a second one starts doing reduce/=20 >> rereduce >> *while the first one is emitting*. This is actually by design. >> Ouch. And to make it worse - the second one seems to gradually =20 >> "take over" until we only see 2-3 emits followed by tons of =20 >> rereduces (all the way up I guess for each emit). This is not. >> Sooo... evidently Couch decides to do stuff in parallell and starts =20= >> doing >> reduce/rereduce while emitting here. AFAIK this is not the behavior >> described. Not sure if it's described, but it is by design. The reduce function =20= executes when the btree is modified. We can't afford to cache KVs =20 from an index update in memory regardless of size; we have to set some =20= threshold when we flush them to disk. I think the fundamental question is why the flush operations were =20 occurring so frequently the second time around. Is it because you =20 were building up a largish hash for the reduce value? Probably. =20 Nevertheless, I'd like to have a better handle on that. Adam >> The net effect is that the view update that took 1-2 seconds >> suddenly takes 400 seconds or goes to a total crawl and never seems =20= >> to end. >> >> By looking at the log it obviously processes ONE doc at a time - =20 >> giving us >> 2-5 emits typically and then tries to reduce that all the way up to =20= >> the root >> before processing the next doc. So the rereduces for the internal =20 >> nodes will >> be run typically in this case 1000x more than needed. >> >> Phew. :) Ok, so we are basically hosed with this behavior in this =20 >> situation. >> I can only presume this has gone unnoticed because: >> >> a) Updates most of us do are small. But we dump thousands of new =20 >> docs using >> bulk (a full new fiscal year of data for a given company) so we =20 >> definitely >> notice it. >> >> b) Most reduce/rereduce functions are very, very fast. So it goes =20 >> unnoticed. >> Our functions are NOT that fast - but if they were only run as they =20= >> should >> (well, presuming they *should* only be run after all the emits for =20= >> all doc >> changes in a given view update) it would indeed be fast anyway. We =20= >> can see >> that since the first 1000 docs work fine. >> >> ...and thanks to the people on #couchdb for discussing this with me =20= >> earlier >> today and looking at the Erlang code to try to figure it out. I =20 >> think Adam >> Kocolski and Robert Newson had some idea about it. >> >> regards, G=F6ran >> >> PS. I am on vacation now for 4 weeks, so I will not be answering =20 >> much email. >> I wanted to get this posted though since it is in some sense a =20 >> rather ... >> serious performance bottleneck. > > > > --=20 > Chris Anderson > http://jchrisa.net > http://couch.io