Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 812 invoked from network); 28 May 2009 20:06:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 May 2009 20:06:01 -0000 Received: (qmail 57482 invoked by uid 500); 28 May 2009 19:54:48 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 57461 invoked by uid 500); 28 May 2009 19:54:48 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 57451 invoked by uid 99); 28 May 2009 19:54:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 May 2009 19:54:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates 209.85.218.218 as permitted sender) Received: from [209.85.218.218] (HELO mail-bw0-f218.google.com) (209.85.218.218) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 May 2009 19:54:37 +0000 Received: by bwz18 with SMTP id 18so6768650bwz.11 for ; Thu, 28 May 2009 12:54:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=LDXoT+lKYRLpEqbgqqvdtAMQU3Rm4m/Sp7MintxGfCE=; b=TEx6/TWCcqSJS6lBhRI1HykkRmq8d3gZIQdK5TyUh9CivyFp6WbU1nUpKO5o+F+Wth Yrt6a+UkeGj3hEYj/cIDbHU+UttNPYMJfNahM+ChhQbojterQMNvnLwbabl4CPFpPl6a 5RxAEAPw5vuzpEqSY85hB51NhM6WipmpmpoE0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=UyChgzYrq+9LLmCZW6pInjad8Swo0Tz/4py+bn6lmymCKZJ75g9PP9MAymFt3t9GSr X3te1ehf4xwg1XWP9a3+qlD01GIFFaH9IeT+tV/tlmypuHQl5IUFTZq5TmOBE83IyttP He8ZSfHnjB1SFQzKIh6/5/wjYnoa2mlhi8CUk= MIME-Version: 1.0 Sender: jchris@gmail.com Received: by 10.204.116.69 with SMTP id l5mr1553962bkq.52.1243540456667; Thu, 28 May 2009 12:54:16 -0700 (PDT) In-Reply-To: <20090528193729.GA26968@uk.tiscali.com> References: <140eba4e0905260421i5b8690d2w57aba4d88f64d69b@mail.gmail.com> <20090528193729.GA26968@uk.tiscali.com> Date: Thu, 28 May 2009 12:54:16 -0700 X-Google-Sender-Auth: 431631a55c73b767 Message-ID: Subject: Re: Reduce limitations From: Chris Anderson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Thu, May 28, 2009 at 12:37 PM, Brian Candler wrote= : > On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote: >> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler wr= ote: >> > Has anyone found a clear definition of the limitations of reduce funct= ions, >> > in terms of how large the reduced data may be? >> >> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3c= EC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e, >> Damian says: >> >> "the size of the reduce value can be logarithmic with respect to the row= s" > > Which doesn't give any guidance as to the absolute maximum size of the > reduce value, only how big it can get in relation to the number of rows, > e.g. > > =A0 max_reduce_object_size =3D k . ln(number of rows reduced) > > for some unknown k. Or is it ln(total size of rows reduced)? > The deal is that if your reduce function's output is the same size as its input, the final reduce value will end up being as large as all the map rows put together. If your reduce function's output is 1/2 the size of it's input, you'll also end up with quite a large amount of data in the final reduce. In these cases each reduction stage actually accumulates more data, as it is based on ever increasing numbers of map rows. If the function reduces data fast enough, the intermediate reduction values will stay relatively constant, even as each reduce stage reflects logarithmically more map rows. This is the kind of reduce function you want. Theoretically, there are no hard limits, and theoretically, even the first kind of function (identity on values, which leads to logarithmic growth of intermediate values) could eventually complete even on a large data set. Practically, the first limit you'll hit is that all the input values for the function will not fit in the JavaScript interpreter's memory space. Even if that were not an issue, the function computation time will likely go up logarithmically; similarly there will be slowdowns in index processing as the reduction values are stored in the btree inner-nodes. Shuffling around multi-gigabyte inner nodes is not optimal and should be avoided. I hope that's clear, let me know if I can make it clearer. Chris --=20 Chris Anderson http://jchrisa.net http://couch.io