lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Benchmarking on GOV2
Date Mon, 29 May 2006 16:05:33 GMT

----- Original Message ----
From: Andrzej Bialecki <>

Dave Kor wrote:
> Hi,
> On 5/29/06, Sebastiano Vigna <> wrote:
>> Dear Lucene developers,
>> I'd be interested in doing some benchmarking on (at least) Lucene,
>> Egothor and MG4J. There is no actual data around on publicly available
>> collections, and it would be nice to have some more objective data on
>> efficiency for a significantly large collection.
> I was wondering if you have seen the TREC 2004 paper by Giuseppe
> Attardi, Andrea Esuli and Chirag Pate from the University of Pisa,
> Italy, titled "Using Clustering and Blade Clusters in the TeraByte
> task"?
> In the paper, three search engines (including Lucene) was benchmarked
> on the GOV2 corpus.

I briefly looked at this document, but the testing environment is not 
described clearly enough. E.g. for Lucene, there is no information about 
the JDK version, heap size, whether it was run with -server or -client. 
Also, the authors mention that "times were obtained after repeating the 
query twice, in order to allow for the effects of memory caching", which 
instantly makes me suspicious ... HotSpot usually requires several 
minutes of warm-up. In short, I think the numbers for Lucene are not to 
be trusted.

OG: there are also command line options that tell the HotSpot how quickly to optimize frequent
execution paths, for instance.

The indexing times seem strange, too - couple minutes for other engines, 
and > 4 hours for Lucene? Something's wrong here ...

OG: But Andrzej, you already wrote that indexing benchmark tool (which we never put anywhere
in SVN, I'm afraid) that works on some freely available Reuters corpus, I believe.  Why couldn't
that be adapted for testing Lucene, Egothor, and MG4J?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message