lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Some quick benchmarks
Date Thu, 08 Dec 2011 00:41:45 GMT
On Wed, Dec 07, 2011 at 10:42:57PM +0100, Nick Wellnhofer wrote:
> Some quick and completely unscientific benchmarks, indexing 1000 times  
> the same 10K ASCII document:
>
> RT = RegexTokenizer
> ST = StandardTokenizer
> CF = CaseFolder
> N  = Normalizer
>
> RT:    2.177s
> RT+CF: 3.964s
> RT+N:  2.556s
> ST:    1.551s
> ST+CF: 3.357s
> ST+N:  1.931s

These numbers are great, and in line with some benchmarks I was also running
today (raw data below).  StandardTokenizer and Normalizer are considerably
faster than RegexTokenizer and the current implementation of CaseFolder, and
thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
SnowballStemmer) by a wide margin:

    Time to index 1000 docs (10 reps, truncated mean)
    =================================================
    PolyAnalyzer   .576 secs
    EasyAnalyzer   .436 secs

Can't wait for StandardTokenizer to land in trunk!

> It's also interesting that moving the tokenizer in front of the case  
> folder or normalizer always gave me faster results.

Yes, I get the same results.  When I first saw the effect, I thought it might
be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
that CaseFolder behaved that way.  I have no explanation, but the results
certainly argue for starting off analysis with tokenization.

Marvin Humphrey

===========================================================================

~/projects/lucy_196/perl $ # RegexTokenizer, pattern => \S+
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.300  Docs: 1000
2    Secs: 0.299  Docs: 1000
3    Secs: 0.297  Docs: 1000
4    Secs: 0.300  Docs: 1000
5    Secs: 0.298  Docs: 1000
6    Secs: 0.299  Docs: 1000
7    Secs: 0.297  Docs: 1000
8    Secs: 0.296  Docs: 1000
9    Secs: 0.300  Docs: 1000
10   Secs: 0.298  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.298 secs 
Truncated mean (6 kept, 4 discarded): 0.298 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # StandardTokenizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.254  Docs: 1000
2    Secs: 0.251  Docs: 1000
3    Secs: 0.253  Docs: 1000
4    Secs: 0.251  Docs: 1000
5    Secs: 0.253  Docs: 1000
6    Secs: 0.252  Docs: 1000
7    Secs: 0.253  Docs: 1000
8    Secs: 0.253  Docs: 1000
9    Secs: 0.251  Docs: 1000
10   Secs: 0.254  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.253 secs 
Truncated mean (6 kept, 4 discarded): 0.253 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # CaseFolder
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.160  Docs: 1000
2    Secs: 0.159  Docs: 1000
3    Secs: 0.160  Docs: 1000
4    Secs: 0.159  Docs: 1000
5    Secs: 0.160  Docs: 1000
6    Secs: 0.158  Docs: 1000
7    Secs: 0.161  Docs: 1000
8    Secs: 0.158  Docs: 1000
9    Secs: 0.160  Docs: 1000
10   Secs: 0.158  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.159 secs 
Truncated mean (6 kept, 4 discarded): 0.159 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # Normalizer
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.150  Docs: 1000
2    Secs: 0.148  Docs: 1000
3    Secs: 0.150  Docs: 1000
4    Secs: 0.149  Docs: 1000
5    Secs: 0.150  Docs: 1000
6    Secs: 0.148  Docs: 1000
7    Secs: 0.150  Docs: 1000
8    Secs: 0.148  Docs: 1000
9    Secs: 0.151  Docs: 1000
10   Secs: 0.148  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.149 secs 
Truncated mean (6 kept, 4 discarded): 0.149 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # PolyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.577  Docs: 1000
2    Secs: 0.577  Docs: 1000
3    Secs: 0.579  Docs: 1000
4    Secs: 0.576  Docs: 1000
5    Secs: 0.576  Docs: 1000
6    Secs: 0.575  Docs: 1000
7    Secs: 0.576  Docs: 1000
8    Secs: 0.575  Docs: 1000
9    Secs: 0.586  Docs: 1000
10   Secs: 0.575  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.577 secs 
Truncated mean (6 kept, 4 discarded): 0.576 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # EasyAnalyzer, language => 'en'
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.437  Docs: 1000
2    Secs: 0.434  Docs: 1000
3    Secs: 0.436  Docs: 1000
4    Secs: 0.437  Docs: 1000
5    Secs: 0.436  Docs: 1000
6    Secs: 0.436  Docs: 1000
7    Secs: 0.441  Docs: 1000
8    Secs: 0.436  Docs: 1000
9    Secs: 0.435  Docs: 1000
10   Secs: 0.435  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.436 secs 
Truncated mean (6 kept, 4 discarded): 0.436 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # [ Normalizer, StandardTokenizer, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.470  Docs: 1000
2    Secs: 0.471  Docs: 1000
3    Secs: 0.472  Docs: 1000
4    Secs: 0.472  Docs: 1000
5    Secs: 0.477  Docs: 1000
6    Secs: 0.470  Docs: 1000
7    Secs: 0.468  Docs: 1000
8    Secs: 0.470  Docs: 1000
9    Secs: 0.471  Docs: 1000
10   Secs: 0.470  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.471 secs 
Truncated mean (6 kept, 4 discarded): 0.471 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ # [ RegexTokenizer, CaseFolder, SnowballStemmer(en) ]
~/projects/lucy_196/perl $ vim ../devel/benchmarks/indexers/BenchmarkingIndexer.pm 
~/projects/lucy_196/perl $ perl -Mblib ../devel/benchmarks/indexers/lucy_indexer.plx --docs=1000
--reps=10
------------------------------------------------------------
1    Secs: 0.555  Docs: 1000
2    Secs: 0.558  Docs: 1000
3    Secs: 0.557  Docs: 1000
4    Secs: 0.555  Docs: 1000
5    Secs: 0.565  Docs: 1000
6    Secs: 0.556  Docs: 1000
7    Secs: 0.555  Docs: 1000
8    Secs: 0.558  Docs: 1000
9    Secs: 0.555  Docs: 1000
10   Secs: 0.553  Docs: 1000
------------------------------------------------------------
Lucy 0.002 
Perl 5.10.0
Thread support: no
Linux 2.6.18-194.el5PAE i686
Mean: 0.557 secs 
Truncated mean (6 kept, 4 discarded): 0.556 secs
------------------------------------------------------------
~/projects/lucy_196/perl $ 



Mime
View raw message