lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 19:19:56 GMT
David Spencer wrote:
>> Let's start with the issue that's been raised so much: whether idf is 
>> better defined with log() or sqrt(log()).
> 
> 
> I can redo my page and rebuild indexes if necessary, I just need it 
> clarified what we want to do, esp -> does the index need to be rebuilt?

The index needs to be rebuilt if Field.setBoost() or Document.setBoost() 
are used (which we're not doing) or if the Similarity.lengthNorm() 
implementation is changed (Chuck may have altered this).  But when 
comparing tf and idf implementations the index need not be rebuilt.

> I guess it's obvious from the above, but just to make it clear - I'll 
> change the page to only do single field queries - but how many 
> variations do we want to see in parallel - the current page shows 2x2 
> results, for each combo of index and query - but I, say, show several 
> more queries in parallel w/ different weights...

For a start, let's look at idf=1/log(), idf=1/sqrt(log()), tf=sqrt() and 
tf=log().  In other words, the DefaultSimilarity definitions and Chuck's 
WikipediaSimilarity definitions.

We should also evaluate Chuck's lengthNorm() method.  That will require 
two indexes (which you already have).

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message