couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: Some guidance with extremely slow indexing
Date Thu, 09 Apr 2009 15:17:23 GMT
Kenneth,

I'm pretty sure you're issue is in the reduce steps for the daily and
montly views. The general rule of thumb is that you shouldn't be
returning data that grows faster than log(#keys processed) where as I
believe your data is growing linearly with input.

This particular limitation is a result of the implementation of
incremental reductions. Basically, each key/pointer pair stores the
re-reduced value for all [re-]reduce values in its children nodes. So
as your reduction moves up the tree the data starts exploding which
kills btree performance not to mention the extra file I/O.

The basic moral of the story is that if you want reduce views like
this per user you should emit a [user_id, date] pair as the key and
then call your reduce views with group=true.

HTH,
Paul Davis

On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
<kenneth.kalmer@gmail.com> wrote:
> Hi everyone
>
> After months of lurking and reading up on couch I finally got the time to
> start using it for an internal mail log analyzer. I parse the logs from our
> Courier-IMAP installation and convert the different lines into documents and
> this has proven to work quite well.
>
> My first task is to extract some metrics from these docs regarding how
> oftern people "pop" their mail, and the returned sizes of each "pop".
> Documents in question look like this:
>
> {
>   "_id": "0000f68e73f3521f3ee8b3b51e0101d7",
>   "_rev": "1-3732031452",
>   "user": "user@example.com",
>   "host": "pop-5",
>   "time": "2009/03/13 05:47:08 +0000",
>   "action": "LOGOUT",
>   "service": "pop3d",
>   "ip": "[10.0.0.1]",
>   "top": "0",
>   "retr": "0"
> }
>
> I've got one design document, with 4 views in. All of them have reduce steps
> as well. I've placed all the code in a Gist to keep the mail clean:
> http://gist.github.com/92476
>
> Basically I get the following from the different views:
>
> * days - Days and number of activities, used as a key lookup for...
> * daily - Total aggregate usage for each user on the day
> * months & monthly work the same as the above, except over months
>
> Updating the indexes are incredibly slow, and I have no idea where to begin
> looking. I suspect my maps are "expensive", but since this is my first shot
> I'll keep quiet and listen to any advice. With "slow" I mean that on my
> local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7) processing a
> 150,000 docs is closing in on 24 hours... On a production site I have
> 3,300,000 docs and over about 18 hours it has only indexed 264,091 documents
> (7%). I built the views using only a couple of hundred docs, probably less
> than 1,000, and didn't expect this to happen...
>
> From reading other posts in the archives I know the initial index can take a
> while, but somehow this just seems a bit ridiculous.
>
> Any advice would be greatly appreciated.
>
> Thanks in advance, and thanks for the awesome tool you guys have built.
>
> Best
>
> --
> Kenneth Kalmer
> kenneth.kalmer@gmail.com
> http://opensourcery.co.za
>

Mime
View raw message