lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Benchmarking on GOV2
Date Mon, 29 May 2006 15:39:34 GMT
Dave Kor wrote:
> Hi,
> On 5/29/06, Sebastiano Vigna <> wrote:
>> Dear Lucene developers,
>> I'd be interested in doing some benchmarking on (at least) Lucene,
>> Egothor and MG4J. There is no actual data around on publicly available
>> collections, and it would be nice to have some more objective data on
>> efficiency for a significantly large collection.
> I was wondering if you have seen the TREC 2004 paper by Giuseppe
> Attardi, Andrea Esuli and Chirag Pate from the University of Pisa,
> Italy, titled "Using Clustering and Blade Clusters in the TeraByte
> task"?
> In the paper, three search engines (including Lucene) was benchmarked
> on the GOV2 corpus.

I briefly looked at this document, but the testing environment is not 
described clearly enough. E.g. for Lucene, there is no information about 
the JDK version, heap size, whether it was run with -server or -client. 
Also, the authors mention that "times were obtained after repeating the 
query twice, in order to allow for the effects of memory caching", which 
instantly makes me suspicious ... HotSpot usually requires several 
minutes of warm-up. In short, I think the numbers for Lucene are not to 
be trusted.

The indexing times seem strange, too - couple minutes for other engines, 
and > 4 hours for Lucene? Something's wrong here ...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message