lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: DefaultSimilarity 2.0?
Date Sat, 18 Dec 2004 00:05:31 GMT
Chuck Williams wrote:
> I think this is a great idea and would be happy to play the game.  Re.
> the collection, there is some benefit to TREC if somebody is going to do
> formal recall and precision computations, otherwise it doesn't matter
> much.  The best Similarity for any collection is likely to be specific
> to the collection, so if the point here is to pick the best
> DefaultSimilarity, the collection should be as representative of Lucene
> users' content as possible (I know this is probably impossible to
> achieve).
> 
> One possible danger in these kinds of bake-offs is that people who know
> the content will likely craft specific queries that are not reflective
> of real users.  It would be good to at least have a standard set of
> queries that was tested against each implementation.  Perhaps each
> person could contribute a set of test queries in addition to their
> Similarity and the combined query set could be tested against each.
> 
> Finally, I'd suggest picking content that has multiple fields and allow
> the individual implementations to decide how to search these fields --
> just title and body would be enough.  I would like to use my
> MaxDisjunctionQuery and see how it compares to other approaches (e.g.,
> the default MultiFieldQueryParser, assuming somebody uses that in this
> test).

I believe the collection that I'm using in LuceneBenchmark meets most if 
not all of these requirements - the "20 newsgroups" corpus. Please see 
the following link for the benchmark code:

	http://www.getopt.org/lb/LuceneBenchmark.java


This collection has the benefit that it's relatively easy to judge the 
relative relevance scores, because the nature and structure of the 
corpus is well understood.

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message