Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of nickretallack@gmail.com
 designates 216.239.58.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=Y8UjC0UNzZ7rznYirJ8YpMcgHYdbU6uolLr+b9PhTXcbsP4ZyvR26jltspDvTfpe6Y
         3+t6aJBOvFQM2lc9xeFG0gO3xtIFOE75Gxj7Sqf18+mxe79wI0V0ShLcmQrXulMnIzso
         4vs2khaSHj3Oownnavmm6SHfTR5GNX8N5Nkec=
Message-ID: <66e809970808201445g9d01346s2278877ef4c4f5f3@mail.gmail.com>
Date: Wed, 20 Aug 2008 14:45:33 -0700
From: "Nicholas Retallack" <nickretallack@gmail.com>
To: couchdb-user@incubator.apache.org
Subject: Re: Reduce is Really Slow!
In-Reply-To: <e2111bbb0808201356v26d64cbatddb480a46dac7471@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <66e809970808182327m4ede024bw131f3c13a19ef31c@mail.gmail.com>
	 <2282DB36-2CFD-4D6B-B33E-A5C0CDCCB2B7@gmail.com>
	 <66e809970808191555k14b53909w8db8ba6e4720e045@mail.gmail.com>
	 <AC9E8A77-781D-42BA-8A92-4859D0750FDC@apache.org>
	 <66e809970808191714k5b3e9a2dn1f3c08923a9ec33b@mail.gmail.com>
	 <EC4EAF59-78BA-4238-9827-B3561E3DC183@apache.org>
	 <66e809970808201332v341f2202s1f6ec522bcc184de@mail.gmail.com>
	 <e2111bbb0808201356v26d64cbatddb480a46dac7471@mail.gmail.com>

Oh clever.  I was considering a solution like this, but I was worried
I wouldn't know where to stop, and might end up chopping it between
some documents that should be grouped together.
"some_value_that_sorts_after_last_possible_date" solves that problem.
There's another problem though, for when I want to do pagination.  Say
I want to display exactly 100 of these on a page.  How do I know I've
fetched 100 of them, if any number of documents could be in a group?
Also, how would I know what document name appears 100 documents ahead
of this one?  This gets messy...

Essentially I figured this should be a task the database is capable of
doing on its own.  I don't want every action in my web application to
have to solve the caching problem on its own, after doing serious
data-munging on all this ugly stuff I got back from the database.  How
do I know when the cache should be invalidated anyway, without insider
knowledge from the database?

Hm, cleverness.  I guess I could figure out what every hundredth name
is by making a view for just the names and querying that.  Any
efficient way to reduce that list for uniqueness?  Perhaps group=true
and reduce = function(){return true}.  There should be a wiki page
devoted to these silly tricks, like this hackish way to put together
pagination.  And tag clouds.

On Wed, Aug 20, 2008 at 1:56 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> If I'm not mistaken, you have a number of documents that all have a
> given 'name'. And you want the list of elements for each value of
> 'name'. To accomplish this in db, land, you could use a design
> document like [1].
>
> Then to get the data for any given doc name, you'd query your map like
> [2]. This gets you everything emitted with a given doc name. The
> underlying idea to remember in getting data out of couch is that your
> maps should emit things that sort together. Then you can use 'slice'
> operations to pull at the documents you need.
>
> You're values aren't magically in one array, but merging the arrays in
> app-land is easy enough.
>
> If I've completely screwed up what you were going after, let me know.
>
> [1] http://www.friendpaste.com/2AHz3ahr
> [2] http://localhost:5984/dbname/_view/design_docid/index?startkey=["docname"]&endkey=["docname",
> some_value_that_sorts_after_last_possible_date]
>
> Paul
>
> On Wed, Aug 20, 2008 at 4:32 PM, Nicholas Retallack
> <nickretallack@gmail.com> wrote:
>> Replacing 'return values' with 'return values.length' shows you're
>> right.  4 minutes for the first query, miliseconds afterward, as
>> opposed to forever.
>>
>> I guess I was expecting reduce to do things it wasn't designed to do.
>> I notice ?group=true&group_level=1 is ignored unless a reduce function
>> of some sort exists though.  Is there any way to get this grouping
>> behavior without such extreme reductions in result size / performance?
>>
>> The view I was using here (http://www.friendpaste.com/2AHz3ahr) was
>> designed to simply take each document with the same name and merge
>> them into one document, turning same-named fields into lists (here's a
>> more general version http://www.friendpaste.com/Ud6ELaXC).  This
>> reduces the document size, but only by whatever overhead the repeated
>> field names would add.  The fields I was reducing only contained
>> integers, so reduction did shrink documents by quite a bit.  It was
>> pretty handy, but the query took 25 seconds to return one result even
>> when called repeatedly.
>>
>> Is there some technical reason for this limitation?
>>
>> I had assumed reduce was just an ordinary post-processing step that I
>> could run once and have something akin to a brand new generated table
>> to query on, so I wrote my views to transform my data to fit the
>> various ways I wanted to view it.  It worked fine for small amounts of
>> data in little experiments, but as soon as I used it on my real
>> database, I hit this wall.
>>
>> Are there plans to make reduce work for these more general
>> data-mangling tasks?  Or should I be approaching the problem a
>> different way?  Perhaps write my map calls differently so they produce
>> more rows for reduce to compact?  Or do something special if the third
>> parameter to reduce is true?
>>
>> On Tue, Aug 19, 2008 at 5:41 PM, Damien Katz <damien@apache.org> wrote:
>>> You can return arrays and objects, whatever json allows. But if the object
>>> keeps getting bigger the more rows it reduces, then it simply won't work.
>>>
>>> The exception is that the size of the reduce value can be logarithmic with
>>> respect to the rows. The simplest example of logarithmic growth is the
>>> summing of a row value. With Erlangs bignums, the size on disk is
>>> Log2(Sum(Rows)), which is perfectly acceptable growth.
>>>
>>> -Damien
>>>
>>> On Aug 19, 2008, at 8:14 PM, Nicholas Retallack wrote:
>>>
>>>> Oh!  I didn't realize that was a rule.  I had used 'return values' in
>>>> attempt to run the simplest test possible on my data.  But hey, values is
>>>> an
>>>> array.  Does that mean you're not allowed to return objects like arrays
>>>> from
>>>> reduce at all?  Because I was kind of hoping I could.  I was able to do it
>>>> with smaller amounts of data, after all.  Perhaps this is due to re-reduce
>>>> kicking in?
>>>>
>>>> For the record, couchdb is still working on this query I started hours
>>>> ago,
>>>> and chewing up all my cpu.  I am going to have to kill it so I can get
>>>> some
>>>> work done.
>>>>
>>>> On Tue, Aug 19, 2008 at 4:21 PM, Damien Katz <damien@apache.org> wrote:
>>>>
>>>>> I think the problem with your reduce is that it looks like its not
>>>>> actually
>>>>> reducing to a single value, but instead using reduce for grouping data.
>>>>> That
>>>>> will cause severe performance problems.
>>>>>
>>>>> For reduce to work properly, you should end up with a fixed size data
>>>>> structure regardless of the number of values being reduced (not stricty
>>>>> true, but that's the general rule).
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 19, 2008, at 6:55 PM, Nicholas Retallack wrote:
>>>>>
>>>>> Okay, I got it built on gentoo instead, but I'm still having performance
>>>>>>
>>>>>> issues with reduce.
>>>>>>
>>>>>> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [async-threads:0]
>>>>>> couchdb - Apache CouchDB 0.8.1-incubating
>>>>>>
>>>>>> Here's a query I tried to do:
>>>>>>
>>>>>> I freshly imported about 191MB of data in 155399 documents.  29090 are
>>>>>> not
>>>>>> discarded by map.  Map produces one row with 5 fields for each of these
>>>>>> documents.  After grouping, each group should have four rows.  Reduce is
>>>>>> a
>>>>>> simple function(keys,values){return values}.
>>>>>>
>>>>>> Here's the query call:
>>>>>> time curl -X GET '
>>>>>>
>>>>>>
>>>>>> http://localhost:5984/clickfund/_view/offers/index?count=1&group=true&group_level=1
>>>>>> '
>>>>>>
>>>>>> This is running on a 512MB slicehost account.  http://www.slicehost.com/
>>>>>>
>>>>>> I'd love to give you this command's execution time, since I ran it last
>>>>>> night before I went to bed, but it must have taken over an hour because
>>>>>> my
>>>>>> laptop went to sleep and severed the connection.  Trying it again.
>>>>>>
>>>>>> Considering it's blazing fast without the reduce function, I can only
>>>>>> assume
>>>>>> what's taking all this time is overhead setting up and tearing down the
>>>>>> simple function(keys,values){return values}.
>>>>>>
>>>>>> I can give you guys the python source to set up this database so you can
>>>>>> try
>>>>>> it yourself if you like.
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>