Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates
 209.85.218.218 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=UyChgzYrq+9LLmCZW6pInjad8Swo0Tz/4py+bn6lmymCKZJ75g9PP9MAymFt3t9GSr
         X3te1ehf4xwg1XWP9a3+qlD01GIFFaH9IeT+tV/tlmypuHQl5IUFTZq5TmOBE83IyttP
         He8ZSfHnjB1SFQzKIh6/5/wjYnoa2mlhi8CUk=
MIME-Version: 1.0
Sender: jchris@gmail.com
In-Reply-To: <20090528193729.GA26968@uk.tiscali.com>
References: <140eba4e0905260421i5b8690d2w57aba4d88f64d69b@mail.gmail.com>
	 <20090528193729.GA26968@uk.tiscali.com>
Date: Thu, 28 May 2009 12:54:16 -0700
Message-ID: <e282921e0905281254h355727cfve12748e3661e4759@mail.gmail.com>
Subject: Re: Reduce limitations
From: Chris Anderson <jchris@apache.org>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, May 28, 2009 at 12:37 PM, Brian Candler <B.Candler@pobox.com> wrote=
:
> On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote:
>> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler <B.Candler@pobox.com> wr=
ote:
>> > Has anyone found a clear definition of the limitations of reduce funct=
ions,
>> > in terms of how large the reduced data may be?
>>
>> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3c=
EC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e,
>> Damian says:
>>
>> "the size of the reduce value can be logarithmic with respect to the row=
s"
>
> Which doesn't give any guidance as to the absolute maximum size of the
> reduce value, only how big it can get in relation to the number of rows,
> e.g.
>
> =A0 max_reduce_object_size =3D k . ln(number of rows reduced)
>
> for some unknown k. Or is it ln(total size of rows reduced)?
>

The deal is that if your reduce function's output is the same size as
its input, the final reduce value will end up being as large as all
the map rows put together.

If your reduce function's output is 1/2 the size of it's input, you'll
also end up with quite a large amount of data in the final reduce. In
these cases each reduction stage actually accumulates more data, as it
is based on ever increasing numbers of map rows.

If the function reduces data fast enough, the intermediate reduction
values will stay relatively constant, even as each reduce stage
reflects logarithmically more map rows. This is the kind of reduce
function you want.

Theoretically, there are no hard limits, and theoretically, even the
first kind of function (identity on values, which leads to logarithmic
growth of intermediate values) could eventually complete even on a
large data set.

Practically, the first limit you'll hit is that all the input values
for the function will not fit in the JavaScript interpreter's memory
space. Even if that were not an issue, the function computation time
will likely go up logarithmically; similarly there will be slowdowns
in index processing as the reduction values are stored in the btree
inner-nodes. Shuffling around multi-gigabyte inner nodes is not
optimal and should be avoided.

I hope that's clear, let me know if I can make it clearer.

Chris

--=20
Chris Anderson
http://jchrisa.net
http://couch.io