lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: Benchmarking on GOV2
Date Mon, 29 May 2006 13:05:28 GMT
That would be great to see! 

There is a million of enhancements and ideas that could come up as a result of this comparison.
For example, I would not be surprised to see mg4j "perfect skipping" to become interesting
optimization for Lucene, Trie based Lexicon could make some regex queries significantly faster...

Any chance for mg4j team to join Lucene development :)  

And off course, care should be taken not to compare apples and oranges, especially on pure
boolean queries as standard Lucene Query proseccing  at this moment always  goes via  scoring.

----- Original Message ----
From: Sebastiano Vigna <>
Sent: Monday, 29 May, 2006 10:39:43 AM
Subject: Benchmarking on GOV2

Dear Lucene developers,
I'd be interested in doing some benchmarking on (at least) Lucene,
Egothor and MG4J. There is no actual data around on publicly available
collections, and it would be nice to have some more objective data on
efficiency for a significantly large collection.

We have GOV2 (25M documents), which is publicly available but must be
bought. We can use it to do the benchmarks, but we will certainly need
some help to configure Lucene so that it works at its best. We have some
reasonably large server that we can allocate to that purpose.

My idea would be to start compression from a text file (one document per
line), so that decompression (GOV2 is in zipped files) and parsing (most
docs are HTML) does not come into play.

We would like to measure indexing time and query answer time--people
from different engines could suggest different queries so that each
engine gets the highlight on its best features. I'd start with pure
Boolean queries in which documents must be returned in index order, so
that the results are the same. In a second phase we can try to compare
the results with ranked queries (which however is going to be more
complicated, and I do not want to duplicate TREC).

Please let me know if you're interested in the project!



To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message