lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <>
Subject Re: [lucy-dev] Some quick benchmarks
Date Thu, 08 Dec 2011 18:02:00 GMT
On 08/12/2011 01:41, Marvin Humphrey wrote:
> These numbers are great, and in line with some benchmarks I was also running
> today (raw data below).  StandardTokenizer and Normalizer are considerably
> faster than RegexTokenizer and the current implementation of CaseFolder, and
> thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
> SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
> SnowballStemmer) by a wide margin:
>      Time to index 1000 docs (10 reps, truncated mean)
>      =================================================
>      PolyAnalyzer   .576 secs
>      EasyAnalyzer   .436 secs

Here is more data from a real world indexing run:

RT+CF: 139 secs
ST+N:  112 secs

> Can't wait for StandardTokenizer to land in trunk!

I don't have any further work planned, so the branch is ready to be merged.

>> It's also interesting that moving the tokenizer in front of the case
>> folder or normalizer always gave me faster results.
> Yes, I get the same results.  When I first saw the effect, I thought it might
> be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
> that CaseFolder behaved that way.  I have no explanation, but the results
> certainly argue for starting off analysis with tokenization.

In Normalizer it's probably because we have to scan the whole document 
twice to find the buffer size which happens rarely if ever when working 
with tokenized words.

Also the benefit from running the normalizer or case folder before the 
tokenizer isn't that great because tokens and most of the text buffers 
are reused. So we don't really save on allocations.


View raw message