incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Retallack" <>
Subject Re: Reduce is Really Slow!
Date Wed, 20 Aug 2008 21:45:33 GMT
Oh clever.  I was considering a solution like this, but I was worried
I wouldn't know where to stop, and might end up chopping it between
some documents that should be grouped together.
"some_value_that_sorts_after_last_possible_date" solves that problem.
There's another problem though, for when I want to do pagination.  Say
I want to display exactly 100 of these on a page.  How do I know I've
fetched 100 of them, if any number of documents could be in a group?
Also, how would I know what document name appears 100 documents ahead
of this one?  This gets messy...

Essentially I figured this should be a task the database is capable of
doing on its own.  I don't want every action in my web application to
have to solve the caching problem on its own, after doing serious
data-munging on all this ugly stuff I got back from the database.  How
do I know when the cache should be invalidated anyway, without insider
knowledge from the database?

Hm, cleverness.  I guess I could figure out what every hundredth name
is by making a view for just the names and querying that.  Any
efficient way to reduce that list for uniqueness?  Perhaps group=true
and reduce = function(){return true}.  There should be a wiki page
devoted to these silly tricks, like this hackish way to put together
pagination.  And tag clouds.

On Wed, Aug 20, 2008 at 1:56 PM, Paul Davis <> wrote:
> If I'm not mistaken, you have a number of documents that all have a
> given 'name'. And you want the list of elements for each value of
> 'name'. To accomplish this in db, land, you could use a design
> document like [1].
> Then to get the data for any given doc name, you'd query your map like
> [2]. This gets you everything emitted with a given doc name. The
> underlying idea to remember in getting data out of couch is that your
> maps should emit things that sort together. Then you can use 'slice'
> operations to pull at the documents you need.
> You're values aren't magically in one array, but merging the arrays in
> app-land is easy enough.
> If I've completely screwed up what you were going after, let me know.
> [1]
> [2] http://localhost:5984/dbname/_view/design_docid/index?startkey=["docname"]&endkey=["docname",
> some_value_that_sorts_after_last_possible_date]
> Paul
> On Wed, Aug 20, 2008 at 4:32 PM, Nicholas Retallack
> <> wrote:
>> Replacing 'return values' with 'return values.length' shows you're
>> right.  4 minutes for the first query, miliseconds afterward, as
>> opposed to forever.
>> I guess I was expecting reduce to do things it wasn't designed to do.
>> I notice ?group=true&group_level=1 is ignored unless a reduce function
>> of some sort exists though.  Is there any way to get this grouping
>> behavior without such extreme reductions in result size / performance?
>> The view I was using here ( was
>> designed to simply take each document with the same name and merge
>> them into one document, turning same-named fields into lists (here's a
>> more general version  This
>> reduces the document size, but only by whatever overhead the repeated
>> field names would add.  The fields I was reducing only contained
>> integers, so reduction did shrink documents by quite a bit.  It was
>> pretty handy, but the query took 25 seconds to return one result even
>> when called repeatedly.
>> Is there some technical reason for this limitation?
>> I had assumed reduce was just an ordinary post-processing step that I
>> could run once and have something akin to a brand new generated table
>> to query on, so I wrote my views to transform my data to fit the
>> various ways I wanted to view it.  It worked fine for small amounts of
>> data in little experiments, but as soon as I used it on my real
>> database, I hit this wall.
>> Are there plans to make reduce work for these more general
>> data-mangling tasks?  Or should I be approaching the problem a
>> different way?  Perhaps write my map calls differently so they produce
>> more rows for reduce to compact?  Or do something special if the third
>> parameter to reduce is true?
>> On Tue, Aug 19, 2008 at 5:41 PM, Damien Katz <> wrote:
>>> You can return arrays and objects, whatever json allows. But if the object
>>> keeps getting bigger the more rows it reduces, then it simply won't work.
>>> The exception is that the size of the reduce value can be logarithmic with
>>> respect to the rows. The simplest example of logarithmic growth is the
>>> summing of a row value. With Erlangs bignums, the size on disk is
>>> Log2(Sum(Rows)), which is perfectly acceptable growth.
>>> -Damien
>>> On Aug 19, 2008, at 8:14 PM, Nicholas Retallack wrote:
>>>> Oh!  I didn't realize that was a rule.  I had used 'return values' in
>>>> attempt to run the simplest test possible on my data.  But hey, values is
>>>> an
>>>> array.  Does that mean you're not allowed to return objects like arrays
>>>> from
>>>> reduce at all?  Because I was kind of hoping I could.  I was able to do it
>>>> with smaller amounts of data, after all.  Perhaps this is due to re-reduce
>>>> kicking in?
>>>> For the record, couchdb is still working on this query I started hours
>>>> ago,
>>>> and chewing up all my cpu.  I am going to have to kill it so I can get
>>>> some
>>>> work done.
>>>> On Tue, Aug 19, 2008 at 4:21 PM, Damien Katz <> wrote:
>>>>> I think the problem with your reduce is that it looks like its not
>>>>> actually
>>>>> reducing to a single value, but instead using reduce for grouping data.
>>>>> That
>>>>> will cause severe performance problems.
>>>>> For reduce to work properly, you should end up with a fixed size data
>>>>> structure regardless of the number of values being reduced (not stricty
>>>>> true, but that's the general rule).
>>>>> -Damien
>>>>> On Aug 19, 2008, at 6:55 PM, Nicholas Retallack wrote:
>>>>> Okay, I got it built on gentoo instead, but I'm still having performance
>>>>>> issues with reduce.
>>>>>> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [async-threads:0]
>>>>>> couchdb - Apache CouchDB 0.8.1-incubating
>>>>>> Here's a query I tried to do:
>>>>>> I freshly imported about 191MB of data in 155399 documents.  29090
>>>>>> not
>>>>>> discarded by map.  Map produces one row with 5 fields for each of
>>>>>> documents.  After grouping, each group should have four rows.  Reduce
>>>>>> a
>>>>>> simple function(keys,values){return values}.
>>>>>> Here's the query call:
>>>>>> time curl -X GET '
>>>>>> http://localhost:5984/clickfund/_view/offers/index?count=1&group=true&group_level=1
>>>>>> '
>>>>>> This is running on a 512MB slicehost account.
>>>>>> I'd love to give you this command's execution time, since I ran it
>>>>>> night before I went to bed, but it must have taken over an hour because
>>>>>> my
>>>>>> laptop went to sleep and severed the connection.  Trying it again.
>>>>>> Considering it's blazing fast without the reduce function, I can
>>>>>> assume
>>>>>> what's taking all this time is overhead setting up and tearing down
>>>>>> simple function(keys,values){return values}.
>>>>>> I can give you guys the python source to set up this database so
you can
>>>>>> try
>>>>>> it yourself if you like.

View raw message