lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Schaefer <joe_schae...@yahoo.com>
Subject Re: [lucy-dev] Some quick benchmarks
Date Thu, 08 Dec 2011 22:38:17 GMT
When is all this nifty code going to land in trunk?  Don't
wait for anyone to give you permission Nick, that decision
is all yours.



----- Original Message -----
> From: Nick Wellnhofer <wellnhofer@aevum.de>
> To: lucy-dev@incubator.apache.org
> Cc: 
> Sent: Thursday, December 8, 2011 2:43 PM
> Subject: Re: [lucy-dev] Some quick benchmarks
> 
> On 08/12/11 20:04, Nathan Kurz wrote:
>>  I'm mostly listening in on this conversation because I haven't 
> thought
>>  much about indexing, but the magnitude of improvement here surprises
>>  me:  I wouldn't have thought that there would be that much time to
>>  shave off!    My presumption was that everything would be dominated by
>>  Disk IO, and that the actual tokenizing time would be tiny.   Are
>>  these numbers both working within memory with a pre-warmed cache so no
>>  disk reads are involved?  Also, have you controlled for whether the
>>  data is sync'ed to disk after the indexing?
> 
> These numbers are with pre-warmed cache. Also, the data isn't synced AFAIU. 
> But I think the analysis chain is CPU bound in the general case. All that 
> tokenizing, normalizing and stemming uses a lot of CPU cycles.
> 
>>  I'm not in a position to do it, but it might be insightful to do a
>>  quick profile of where these two are spending their time.  Are we
>>  gaining because the algorithm is faster, or because we have less
>>  function call overhead, or because of something confounding?
> 
> It's mainly that the algorithms are faster. The CaseFolder seems to be 
> especially slow but I have no idea why.
> 
>>  Oprofile
>>  on Linux is very easy to use once you have it set up.  In case you
>>  aren't familiar with it, this is a good intro:
>> 
> http://lbrandy.com/blog/2008/11/oprofile-profiling-in-linux-for-fun-and-profit/.
> 
> I have used it once and found it hard to setup on a virtual machine. But 
> it's very useful if you want to profile long running processes.
> 
> Nick
> 

Mime
View raw message