incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicholas Retallack" <nickretall...@gmail.com>
Subject Re: Reduce is Really Slow!
Date Wed, 20 Aug 2008 21:45:33 GMT
Oh clever.  I was considering a solution like this, but I was worried
I wouldn't know where to stop, and might end up chopping it between
some documents that should be grouped together.
"some_value_that_sorts_after_last_possible_date" solves that problem.
There's another problem though, for when I want to do pagination.  Say
I want to display exactly 100 of these on a page.  How do I know I've
fetched 100 of them, if any number of documents could be in a group?
Also, how would I know what document name appears 100 documents ahead
of this one?  This gets messy...

Essentially I figured this should be a task the database is capable of
doing on its own.  I don't want every action in my web application to
have to solve the caching problem on its own, after doing serious
data-munging on all this ugly stuff I got back from the database.  How
do I know when the cache should be invalidated anyway, without insider
knowledge from the database?

Hm, cleverness.  I guess I could figure out what every hundredth name
is by making a view for just the names and querying that.  Any
efficient way to reduce that list for uniqueness?  Perhaps group=true
and reduce = function(){return true}.  There should be a wiki page
devoted to these silly tricks, like this hackish way to put together
pagination.  And tag clouds.

On Wed, Aug 20, 2008 at 1:56 PM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> If I'm not mistaken, you have a number of documents that all have a
> given 'name'. And you want the list of elements for each value of
> 'name'. To accomplish this in db, land, you could use a design
> document like [1].
>
> Then to get the data for any given doc name, you'd query your map like
> [2]. This gets you everything emitted with a given doc name. The
> underlying idea to remember in getting data out of couch is that your
> maps should emit things that sort together. Then you can use 'slice'
> operations to pull at the documents you need.
>
> You're values aren't magically in one array, but merging the arrays in
> app-land is easy enough.
>
> If I've completely screwed up what you were going after, let me know.
>
> [1] http://www.friendpaste.com/2AHz3ahr
> [2] http://localhost:5984/dbname/_view/design_docid/index?startkey=["docname"]&endkey=["docname",
> some_value_that_sorts_after_last_possible_date]
>
> Paul
>
> On Wed, Aug 20, 2008 at 4:32 PM, Nicholas Retallack
> <nickretallack@gmail.com> wrote:
>> Replacing 'return values' with 'return values.length' shows you're
>> right.  4 minutes for the first query, miliseconds afterward, as
>> opposed to forever.
>>
>> I guess I was expecting reduce to do things it wasn't designed to do.
>> I notice ?group=true&group_level=1 is ignored unless a reduce function
>> of some sort exists though.  Is there any way to get this grouping
>> behavior without such extreme reductions in result size / performance?
>>
>> The view I was using here (http://www.friendpaste.com/2AHz3ahr) was
>> designed to simply take each document with the same name and merge
>> them into one document, turning same-named fields into lists (here's a
>> more general version http://www.friendpaste.com/Ud6ELaXC).  This
>> reduces the document size, but only by whatever overhead the repeated
>> field names would add.  The fields I was reducing only contained
>> integers, so reduction did shrink documents by quite a bit.  It was
>> pretty handy, but the query took 25 seconds to return one result even
>> when called repeatedly.
>>
>> Is there some technical reason for this limitation?
>>
>> I had assumed reduce was just an ordinary post-processing step that I
>> could run once and have something akin to a brand new generated table
>> to query on, so I wrote my views to transform my data to fit the
>> various ways I wanted to view it.  It worked fine for small amounts of
>> data in little experiments, but as soon as I used it on my real
>> database, I hit this wall.
>>
>> Are there plans to make reduce work for these more general
>> data-mangling tasks?  Or should I be approaching the problem a
>> different way?  Perhaps write my map calls differently so they produce
>> more rows for reduce to compact?  Or do something special if the third
>> parameter to reduce is true?
>>
>> On Tue, Aug 19, 2008 at 5:41 PM, Damien Katz <damien@apache.org> wrote:
>>> You can return arrays and objects, whatever json allows. But if the object
>>> keeps getting bigger the more rows it reduces, then it simply won't work.
>>>
>>> The exception is that the size of the reduce value can be logarithmic with
>>> respect to the rows. The simplest example of logarithmic growth is the
>>> summing of a row value. With Erlangs bignums, the size on disk is
>>> Log2(Sum(Rows)), which is perfectly acceptable growth.
>>>
>>> -Damien
>>>
>>> On Aug 19, 2008, at 8:14 PM, Nicholas Retallack wrote:
>>>
>>>> Oh!  I didn't realize that was a rule.  I had used 'return values' in
>>>> attempt to run the simplest test possible on my data.  But hey, values is
>>>> an
>>>> array.  Does that mean you're not allowed to return objects like arrays
>>>> from
>>>> reduce at all?  Because I was kind of hoping I could.  I was able to do it
>>>> with smaller amounts of data, after all.  Perhaps this is due to re-reduce
>>>> kicking in?
>>>>
>>>> For the record, couchdb is still working on this query I started hours
>>>> ago,
>>>> and chewing up all my cpu.  I am going to have to kill it so I can get
>>>> some
>>>> work done.
>>>>
>>>> On Tue, Aug 19, 2008 at 4:21 PM, Damien Katz <damien@apache.org> wrote:
>>>>
>>>>> I think the problem with your reduce is that it looks like its not
>>>>> actually
>>>>> reducing to a single value, but instead using reduce for grouping data.
>>>>> That
>>>>> will cause severe performance problems.
>>>>>
>>>>> For reduce to work properly, you should end up with a fixed size data
>>>>> structure regardless of the number of values being reduced (not stricty
>>>>> true, but that's the general rule).
>>>>>
>>>>> -Damien
>>>>>
>>>>>
>>>>> On Aug 19, 2008, at 6:55 PM, Nicholas Retallack wrote:
>>>>>
>>>>> Okay, I got it built on gentoo instead, but I'm still having performance
>>>>>>
>>>>>> issues with reduce.
>>>>>>
>>>>>> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [async-threads:0]
>>>>>> couchdb - Apache CouchDB 0.8.1-incubating
>>>>>>
>>>>>> Here's a query I tried to do:
>>>>>>
>>>>>> I freshly imported about 191MB of data in 155399 documents.  29090
are
>>>>>> not
>>>>>> discarded by map.  Map produces one row with 5 fields for each of
these
>>>>>> documents.  After grouping, each group should have four rows.  Reduce
is
>>>>>> a
>>>>>> simple function(keys,values){return values}.
>>>>>>
>>>>>> Here's the query call:
>>>>>> time curl -X GET '
>>>>>>
>>>>>>
>>>>>> http://localhost:5984/clickfund/_view/offers/index?count=1&group=true&group_level=1
>>>>>> '
>>>>>>
>>>>>> This is running on a 512MB slicehost account.  http://www.slicehost.com/
>>>>>>
>>>>>> I'd love to give you this command's execution time, since I ran it
last
>>>>>> night before I went to bed, but it must have taken over an hour because
>>>>>> my
>>>>>> laptop went to sleep and severed the connection.  Trying it again.
>>>>>>
>>>>>> Considering it's blazing fast without the reduce function, I can
only
>>>>>> assume
>>>>>> what's taking all this time is overhead setting up and tearing down
the
>>>>>> simple function(keys,values){return values}.
>>>>>>
>>>>>> I can give you guys the python source to set up this database so
you can
>>>>>> try
>>>>>> it yourself if you like.
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>

Mime
View raw message