Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of adam.kocoloski@gmail.com
 designates 209.85.221.191 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=sender:message-id:from:to:in-reply-to:content-type
         :content-transfer-encoding:mime-version:subject:date:references
         :x-mailer;
        b=RPgpZfBK8QGcJcdMZi3UAUiA10R4QC7pu295Y6aQY90p3rTYszPurKsAcO69T4j7a9
         huEicv4fLFXoc7+HsKF1uiIsKgL2ci+iu+sEz3X3wFSTePTUe5MJYtgPW5iDrT/XY1Wh
         YZyq1BPPrddAOvyTicl+IHCGe8yrjkvY8plcw=
Sender: Adam Kocoloski <adam.kocoloski@gmail.com>
Message-Id: <F5B7CAE3-77CC-4046-902E-020D38B4DDA8@apache.org>
From: Adam Kocoloski <kocolosk@apache.org>
To: dev@couchdb.apache.org
In-Reply-To: <e282921e0907031537h1628ee2fj4516d7540167e8ee@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Apple Message framework v935.3)
Subject: Re: Possible bug in indexer... (really)
Date: Fri, 3 Jul 2009 20:35:45 -0400
References: <4A4E7F53.7010406@krampe.se>
 <e282921e0907031537h1628ee2fj4516d7540167e8ee@mail.gmail.com>

On Jul 3, 2009, at 6:37 PM, Chris Anderson wrote:

> 2009/7/3 G=F6ran Krampe <goran@krampe.se>:
>> Hi folks!
>>
>> We are writing an app using CouchDB where we tried to do some map/=20
>> reduce to
>> calculate "period sums" for about 1000 different "accounts". This =20
>> is fiscal
>> data btw, the system is meant to store detailed fiscal data for =20
>> about 50000
>> companies, for starters. :)
>>
>> The map function is trivial, it just emits a bunch of "accountNo, =20
>> amount"
>> pairs with "month" as key.
>>
>> The reduce/rereduce take these and builds a dictionary (JSON =20
>> object) with
>> "month-accountNo" as key (like "2009/10-2335" and the sum as the =20
>> value. This
>> works fine, yes, it builds up a bit but there is a maximum of account
>> numbers and months so it doesn't grow out of control, so that is =20
>> NOT the
>> issue.
>
> There is *no reason ever* to build up a dictionary with more then a
> small handful of items in it. Eg it's ok if your dictionary has this
> fixed set of keys: count, total, stddev, avg.
>
> It's not OK to do what you are doing. This is what group_level is for.
> Rewrite your map reduce to be correct and then we can start talking
> about performance.
>
> I don't mean to be harsh but suggesting you have a performance problem
> here is like me complaining that my Ferrari makes a bad boat.
>
> Cheers,
> Chris

Wow, that was unusually harsh coming from you, Chris.  Taking a closer =20=

look at G=F6ran's map and reduce functions I agree that they should be =20=

reworked to make use of group=3Dtrue, but nevertheless I wonder if we do =
=20
have something to work on here.

I believe G=F6ran's problem was that the second pass was causing the =20
view updater process to use a significant amount of memory and trigger =20=

should_flush() immediately.  As a result, view KVs were being written =20=

to disk after every document (triggering the reduce/rereduce step).  =20
This is fantastically inefficient.  If the penalty for flushing =20
results to disk during indexing is so severe, perhaps we want to be a =20=

little more flexible in imposing it.  There could be very legitimate =20
cases where users with large documents and/or sophisticated workflows =20=

are hung out to dry during indexing because the view updater wants a =20
measly 11MB of memory to do its job.

Adam

>> Ok, here comes the punchline. When we dump the first 1000 docs =20
>> using bulk,
>> which typically will amount to say 5000 emits - and we "touch" the =20=

>> view to
>> trigger it - it will be rather fast and behaves like this:
>>
>> - a single Erlang process runs and emits all values, then it does a =20=

>> bunch or
>> reduce on those values and finally it switches into rereduce mode =20
>> and does
>> those and then you can see the dictionary "growing" a bit but never =20=

>> too
>> much. It is pretty fast, a second or two all in all.
>>
>> Fine. Them we dump the *next* 1000 docs into Couch and triggers the =20=

>> view
>> again. This time it behaves like this (believe it or not):
>>
>> - two Erlang processes get into play. It seems the same process as =20=

>> above
>> continues with emits (IIRC) but a second one starts doing reduce/=20
>> rereduce
>> *while the first one is emitting*.

This is actually by design.

>> Ouch. And to make it worse - the second one seems to gradually =20
>> "take over" until we only see 2-3 emits followed by tons of =20
>> rereduces (all the way up I guess for each emit).

This is not.

>> Sooo... evidently Couch decides to do stuff in parallell and starts =20=

>> doing
>> reduce/rereduce while emitting here. AFAIK this is not the behavior
>> described.

Not sure if it's described, but it is by design.  The reduce function =20=

executes when the btree is modified.  We can't afford to cache KVs =20
from an index update in memory regardless of size; we have to set some =20=

threshold when we flush them to disk.

I think the fundamental question is why the flush operations were =20
occurring so frequently the second time around.  Is it because you =20
were building up a largish hash for the reduce value?  Probably.  =20
Nevertheless, I'd like to have a better handle on that.

Adam

>> The net effect is that the view update that took 1-2 seconds
>> suddenly takes 400 seconds or goes to a total crawl and never seems =20=

>> to end.
>>
>> By looking at the log it obviously processes ONE doc at a time - =20
>> giving us
>> 2-5 emits typically and then tries to reduce that all the way up to =20=

>> the root
>> before processing the next doc. So the rereduces for the internal =20
>> nodes will
>> be run typically in this case 1000x more than needed.
>>
>> Phew. :) Ok, so we are basically hosed with this behavior in this =20
>> situation.
>> I can only presume this has gone unnoticed because:
>>
>> a) Updates most of us do are small. But we dump thousands of new =20
>> docs using
>> bulk (a full new fiscal year of data for a given company) so we =20
>> definitely
>> notice it.
>>
>> b) Most reduce/rereduce functions are very, very fast. So it goes =20
>> unnoticed.
>> Our functions are NOT that fast - but if they were only run as they =20=

>> should
>> (well, presuming they *should* only be run after all the emits for =20=

>> all doc
>> changes in a given view update) it would indeed be fast anyway. We =20=

>> can see
>> that since the first 1000 docs work fine.
>>
>> ...and thanks to the people on #couchdb for discussing this with me =20=

>> earlier
>> today and looking at the Erlang code to try to figure it out. I =20
>> think Adam
>> Kocolski and Robert Newson had some idea about it.
>>
>> regards, G=F6ran
>>
>> PS. I am on vacation now for 4 weeks, so I will not be answering =20
>> much email.
>> I wanted to get this posted though since it is in some sense a =20
>> rather ...
>> serious performance bottleneck.
>
>
>
> --=20
> Chris Anderson
> http://jchrisa.net
> http://couch.io