lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: DefaultSimilarity 2.0?
Date Fri, 17 Dec 2004 22:23:24 GMT
Doug Cutting wrote:

> Chuck Williams wrote:
>> Another issue will likely be the tf() and idf() computations.  I have a
>> similar desired relevance ranking and was not getting what I wanted due
>> to the idf() term dominating the score. [ ... ]
> Chuck has made a series of criticisms of the DefaultSimilarity 
> implementation.  Unfortunately it is difficult to quickly evaluate 
> these, as it requires relevance judgements.  But, still, we should 
> consider modifying DefaultSimilarity for the 2.0 release if there are 
> easy improvements to be had.  But how do we decide what's better?
> Perhaps we should perform a formal or semi-formal evaluation of various 
> Similarity implementations on a reference collection.  For example, for 
> a formal evalution we might use one the TREC Web collections, which have 
> associated queries and relevance judgements.  Or, less formally, we 
> could use a crawl of the ~5M pages in DMOZ (I would be glad to collect 
> these using Nutch).
> This could work as follows:
>   -- Different folks could download and index a reference collection, 
> offering demonstration search systems.  We would provide complete code. 
>  These would differ only in their Similarity implementation.  All 
> implementations would use the same Analyzer and search only a single field.
>   -- These folks could then announce their candiate implementations and 
> let others run queries against them, via HTTP.  Different Similarity 
> implementations could thus be publicly and interactively compared.
>   -- Hopefully a consensus, or at least a healthy majority, would agree 
> on which was the best implementation and we could make that the default 
> for Lucene 2.0.
> Are there folks (e.g., Chuck) who would be willing to play this game? 

I can prob play the game and offer resources, esp if disk space needed 
is not many GB...1GB is fine. I'm just not clear on how many people you 
need participating - one person per Similarity proposal? I do not have a 
Similarity proposal myself...

> Should we make it more formal, using, e.g., TREC?  Does anyone have 
> other ideas how we should decide how to modify DefaultSimilarity?
> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message