On Thu, May 28, 2009 at 12:37 PM, Brian Candler wrote:
> On Tue, May 26, 2009 at 12:21:10PM +0100, Michael Stillwell wrote:
>> On Sat, Apr 25, 2009 at 10:39 PM, Brian Candler wrote:
>> > Has anyone found a clear definition of the limitations of reduce functions,
>> > in terms of how large the reduced data may be?
>>
>> In http://mail-archives.apache.org/mod_mbox/couchdb-user/200808.mbox/%3cEC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org%3e,
>> Damian says:
>>
>> "the size of the reduce value can be logarithmic with respect to the rows"
>
> Which doesn't give any guidance as to the absolute maximum size of the
> reduce value, only how big it can get in relation to the number of rows,
> e.g.
>
> max_reduce_object_size = k . ln(number of rows reduced)
>
> for some unknown k. Or is it ln(total size of rows reduced)?
>
The deal is that if your reduce function's output is the same size as
its input, the final reduce value will end up being as large as all
the map rows put together.
If your reduce function's output is 1/2 the size of it's input, you'll
also end up with quite a large amount of data in the final reduce. In
these cases each reduction stage actually accumulates more data, as it
is based on ever increasing numbers of map rows.
If the function reduces data fast enough, the intermediate reduction
values will stay relatively constant, even as each reduce stage
reflects logarithmically more map rows. This is the kind of reduce
function you want.
Theoretically, there are no hard limits, and theoretically, even the
first kind of function (identity on values, which leads to logarithmic
growth of intermediate values) could eventually complete even on a
large data set.
Practically, the first limit you'll hit is that all the input values
for the function will not fit in the JavaScript interpreter's memory
space. Even if that were not an issue, the function computation time
will likely go up logarithmically; similarly there will be slowdowns
in index processing as the reduction values are stored in the btree
inner-nodes. Shuffling around multi-gigabyte inner nodes is not
optimal and should be avoided.
I hope that's clear, let me know if I can make it clearer.
Chris
--
Chris Anderson
http://jchrisa.net
http://couch.io