lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 18:59:07 GMT
Doug Cutting wrote:

> David Spencer wrote:
>> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
>> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
>> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 
>> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5
> This looks great to me!  I'd make mand=true by default, i.e., have a 
> method where this parameter is not specified.  Similarly, we might 
> default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to 
> infinity.  What we want is something that provides only the knobs that 
> we think most folks will need.  Ideally we wouldn't even need to specify 
> fieldBoosts.  Short fields like titles get a larger lengthNorm, which 
> effectively boosts them a lot already.

Yeah I agree w/ all of the above, offer options but have easy to use 
ways of calling it w/ intelligent defaults.
> But perhaps we should back off and first just evaluate single field 
> search with different idf, tf (and perhaps lengthNorm and sloppyFreq) 
> definitions.  Once we're happy with those, then we should return to 
> different multi-field query formulations.
> Let's start with the issue that's been raised so much: whether idf is 
> better defined with log() or sqrt(log()).

I can redo my page and rebuild indexes if necessary, I just need it 
clarified what we want to do, esp -> does the index need to be rebuilt?


I currently have 2 variations on the index, one w/ the default settings 
and another with the Similarity code Chuck attached to the bug report. 
Do we need other variations on the index e.g. with different weights, or 
  during indexing are the weights less important than the log() vs. 
sqrt(log()) issue?


I guess it's obvious from the above, but just to make it clear - I'll 
change the page to only do single field queries - but how many 
variations do we want to see in parallel - the current page shows 2x2 
results, for each combo of index and query - but I, say, show several 
more queries in parallel w/ different weights...

> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message