lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <>
Subject RE: DefaultSimilarity 2.0?
Date Fri, 17 Dec 2004 21:42:04 GMT
I think this is a great idea and would be happy to play the game.  Re.
the collection, there is some benefit to TREC if somebody is going to do
formal recall and precision computations, otherwise it doesn't matter
much.  The best Similarity for any collection is likely to be specific
to the collection, so if the point here is to pick the best
DefaultSimilarity, the collection should be as representative of Lucene
users' content as possible (I know this is probably impossible to

One possible danger in these kinds of bake-offs is that people who know
the content will likely craft specific queries that are not reflective
of real users.  It would be good to at least have a standard set of
queries that was tested against each implementation.  Perhaps each
person could contribute a set of test queries in addition to their
Similarity and the combined query set could be tested against each.

Finally, I'd suggest picking content that has multiple fields and allow
the individual implementations to decide how to search these fields --
just title and body would be enough.  I would like to use my
MaxDisjunctionQuery and see how it compares to other approaches (e.g.,
the default MultiFieldQueryParser, assuming somebody uses that in this


  > -----Original Message-----
  > From: Doug Cutting []
  > Sent: Friday, December 17, 2004 1:27 PM
  > To: Lucene Developers List
  > Subject: DefaultSimilarity 2.0?
  > Chuck Williams wrote:
  > > Another issue will likely be the tf() and idf() computations.  I
  > a
  > > similar desired relevance ranking and was not getting what I
  > due
  > > to the idf() term dominating the score. [ ... ]
  > Chuck has made a series of criticisms of the DefaultSimilarity
  > implementation.  Unfortunately it is difficult to quickly evaluate
  > these, as it requires relevance judgements.  But, still, we should
  > consider modifying DefaultSimilarity for the 2.0 release if there
  > easy improvements to be had.  But how do we decide what's better?
  > Perhaps we should perform a formal or semi-formal evaluation of
  > Similarity implementations on a reference collection.  For example,
  > a formal evalution we might use one the TREC Web collections, which
  > associated queries and relevance judgements.  Or, less formally, we
  > could use a crawl of the ~5M pages in DMOZ (I would be glad to
  > these using Nutch).
  > This could work as follows:
  >    -- Different folks could download and index a reference
  > offering demonstration search systems.  We would provide complete
  >   These would differ only in their Similarity implementation.  All
  > implementations would use the same Analyzer and search only a single
  > field.
  >    -- These folks could then announce their candiate implementations
  > let others run queries against them, via HTTP.  Different Similarity
  > implementations could thus be publicly and interactively compared.
  >    -- Hopefully a consensus, or at least a healthy majority, would
  > on which was the best implementation and we could make that the
  > for Lucene 2.0.
  > Are there folks (e.g., Chuck) who would be willing to play this
  > Should we make it more formal, using, e.g., TREC?  Does anyone have
  > other ideas how we should decide how to modify DefaultSimilarity?
  > Doug
  > To unsubscribe, e-mail:
  > For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message