lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] Some quick benchmarks
Date Thu, 08 Dec 2011 19:43:36 GMT
On 08/12/11 20:04, Nathan Kurz wrote:
> I'm mostly listening in on this conversation because I haven't thought
> much about indexing, but the magnitude of improvement here surprises
> me:  I wouldn't have thought that there would be that much time to
> shave off!    My presumption was that everything would be dominated by
> Disk IO, and that the actual tokenizing time would be tiny.   Are
> these numbers both working within memory with a pre-warmed cache so no
> disk reads are involved?  Also, have you controlled for whether the
> data is sync'ed to disk after the indexing?

These numbers are with pre-warmed cache. Also, the data isn't synced 
AFAIU. But I think the analysis chain is CPU bound in the general case. 
All that tokenizing, normalizing and stemming uses a lot of CPU cycles.

> I'm not in a position to do it, but it might be insightful to do a
> quick profile of where these two are spending their time.  Are we
> gaining because the algorithm is faster, or because we have less
> function call overhead, or because of something confounding?

It's mainly that the algorithms are faster. The CaseFolder seems to be 
especially slow but I have no idea why.

> Oprofile
> on Linux is very easy to use once you have it set up.  In case you
> aren't familiar with it, this is a good intro:

I have used it once and found it hard to setup on a virtual machine. But 
it's very useful if you want to profile long running processes.


View raw message