lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 18:01:03 GMT
David Spencer wrote:
> 
> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> 
> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
> 
> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 
> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5

This looks great to me!  I'd make mand=true by default, i.e., have a 
method where this parameter is not specified.  Similarly, we might 
default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to 
infinity.  What we want is something that provides only the knobs that 
we think most folks will need.  Ideally we wouldn't even need to specify 
fieldBoosts.  Short fields like titles get a larger lengthNorm, which 
effectively boosts them a lot already.

But perhaps we should back off and first just evaluate single field 
search with different idf, tf (and perhaps lengthNorm and sloppyFreq) 
definitions.  Once we're happy with those, then we should return to 
different multi-field query formulations.

Let's start with the issue that's been raised so much: whether idf is 
better defined with log() or sqrt(log()).

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message