couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: Some guidance with extremely slow indexing
Date Thu, 09 Apr 2009 19:45:19 GMT

On Apr 9, 2009, at 11:17 AM, Paul Davis wrote:

> Kenneth,
>
> I'm pretty sure you're issue is in the reduce steps for the daily and
> montly views. The general rule of thumb is that you shouldn't be
> returning data that grows faster than log(#keys processed) where as I
> believe your data is growing linearly with input.
>
> This particular limitation is a result of the implementation of
> incremental reductions. Basically, each key/pointer pair stores the
> re-reduced value for all [re-]reduce values in its children nodes. So
> as your reduction moves up the tree the data starts exploding which
> kills btree performance not to mention the extra file I/O.
>
> The basic moral of the story is that if you want reduce views like
> this per user you should emit a [user_id, date] pair as the key and
> then call your reduce views with group=true.

+1 Paul.

New users hit this problem a lot, and since it's manifests as a  
performance problem, users spend more time than necessary trying to  
figure out what's wrong. I wonder if there is something we can do to  
make it more obvious when reduce is used incorrectly? Perhaps a limit  
(say 1k) on the size of the reduce value, and when it's exceeded a  
"reduce value to large" error is generated. In process of  
investigating the error they'll be more likely find the documentation  
that explains what they doing wrong.

Moving this discussion to dev@. Anyone else have any thoughts or ideas?

-Damien

>
> HTH,
> Paul Davis
>
> On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
> <kenneth.kalmer@gmail.com> wrote:
>> Hi everyone
>>
>> After months of lurking and reading up on couch I finally got the  
>> time to
>> start using it for an internal mail log analyzer. I parse the logs  
>> from our
>> Courier-IMAP installation and convert the different lines into  
>> documents and
>> this has proven to work quite well.
>>
>> My first task is to extract some metrics from these docs regarding  
>> how
>> oftern people "pop" their mail, and the returned sizes of each "pop".
>> Documents in question look like this:
>>
>> {
>>   "_id": "0000f68e73f3521f3ee8b3b51e0101d7",
>>   "_rev": "1-3732031452",
>>   "user": "user@example.com",
>>   "host": "pop-5",
>>   "time": "2009/03/13 05:47:08 +0000",
>>   "action": "LOGOUT",
>>   "service": "pop3d",
>>   "ip": "[10.0.0.1]",
>>   "top": "0",
>>   "retr": "0"
>> }
>>
>> I've got one design document, with 4 views in. All of them have  
>> reduce steps
>> as well. I've placed all the code in a Gist to keep the mail clean:
>> http://gist.github.com/92476
>>
>> Basically I get the following from the different views:
>>
>> * days - Days and number of activities, used as a key lookup for...
>> * daily - Total aggregate usage for each user on the day
>> * months & monthly work the same as the above, except over months
>>
>> Updating the indexes are incredibly slow, and I have no idea where  
>> to begin
>> looking. I suspect my maps are "expensive", but since this is my  
>> first shot
>> I'll keep quiet and listen to any advice. With "slow" I mean that  
>> on my
>> local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7)  
>> processing a
>> 150,000 docs is closing in on 24 hours... On a production site I have
>> 3,300,000 docs and over about 18 hours it has only indexed 264,091  
>> documents
>> (7%). I built the views using only a couple of hundred docs,  
>> probably less
>> than 1,000, and didn't expect this to happen...
>>
>> From reading other posts in the archives I know the initial index  
>> can take a
>> while, but somehow this just seems a bit ridiculous.
>>
>> Any advice would be greatly appreciated.
>>
>> Thanks in advance, and thanks for the awesome tool you guys have  
>> built.
>>
>> Best
>>
>> --
>> Kenneth Kalmer
>> kenneth.kalmer@gmail.com
>> http://opensourcery.co.za
>>


Mime
View raw message