lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 19:17:41 GMT
David Spencer wrote:
  > [1]
  > 
  > I currently have 2 variations on the index, one w/ the default
settings
  > and another with the Similarity code Chuck attached to the bug
report.
  > Do we need other variations on the index e.g. with different
weights, or
  >   during indexing are the weights less important than the log() vs.
  > sqrt(log()) issue?

My Similarity eliminates the idf^2 by using sqrt(log()), changes the
base of the logarithm for flattening tf and idf from e to 10 (or any
parameter setting at runtime), changes the lengthNorm flattening from
sqrt to log base-10 (not settable at runtime), and adds 1000 to all
field lengths (normalizing this re. the log base-10 by changing the
numerator from 1 to 3 = log10(1000)).

The net effects are to increase flattening of tf and idf by a constant,
increase flattening of lengthNorm fundamentally (sqrt to log), and
eliminate large lengthNorm effects with very small fields (further
flattening its effect).

At least in the case of multiple fields with meaningful field-boosts,
I've found these all improve relevance (i.e., in my app).  I found and
made the changes 1-at-a-time based on analyzing explain()'s with result
lists my app produces.

Re. this analysis, any sequencing of considering the different changes
is fine with me, although once again, I don't think these are completely
orthogonal considerations.  The combination of Similarity tuning
decisions has impact above-and-beyond the individual effects.

  > [2]
  > 
  > I guess it's obvious from the above, but just to make it clear -
I'll
  > change the page to only do single field queries - but how many
  > variations do we want to see in parallel - the current page shows
2x2
  > results, for each combo of index and query - but I, say, show
several
  > more queries in parallel w/ different weights...
  >

I'd like to keep the current multi-field results as there hasn't been
much analysis of this yet.

Re. other scenarios, I think we should look at:
  1.  Current QueryParser and DefaultSimilarity with single field and
Default-OR.
  2.  Above with Default-AND.
  3.  My Similarity (or subset thereof) and current QueryParser with
Default-OR.
  4.  " with Default-AND


Consideration of proximity solutions (e.g., Doug's DensityQuery for
Default-AND, and what I'm proposing for Default-OR) should be separate.

My $0.02,

Chuck

  > -----Original Message-----
  > From: David Spencer [mailto:dave-lucene-dev@tropo.com]
  > Sent: Tuesday, February 01, 2005 10:59 AM
  > To: Lucene Developers List
  > Subject: Re: URL to compare 2 Similarity's ready-- Re: Scoring
benchmark
  > evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher
  > problems with Similarity.docFreq() ?
  > 
  > Doug Cutting wrote:
  > 
  > > David Spencer wrote:
  > >
  > >>
  > >> +(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
  > >>
  > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5
  > >>
  > >> (f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4)
  > (f1:t5^2.0
  > >> t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5
  > >
  > >
  > > This looks great to me!  I'd make mand=true by default, i.e., have
a
  > > method where this parameter is not specified.  Similarly, we might
  > > default phraseBoosts[i] to boolBoosts[i]*phraseBoost, and slops to
  > > infinity.  What we want is something that provides only the knobs
that
  > > we think most folks will need.  Ideally we wouldn't even need to
  > specify
  > > fieldBoosts.  Short fields like titles get a larger lengthNorm,
which
  > > effectively boosts them a lot already.
  > 
  > Yeah I agree w/ all of the above, offer options but have easy to use
  > ways of calling it w/ intelligent defaults.
  > >
  > > But perhaps we should back off and first just evaluate single
field
  > > search with different idf, tf (and perhaps lengthNorm and
sloppyFreq)
  > > definitions.  Once we're happy with those, then we should return
to
  > > different multi-field query formulations.
  > >
  > > Let's start with the issue that's been raised so much: whether idf
is
  > > better defined with log() or sqrt(log()).
  > 
  > I can redo my page and rebuild indexes if necessary, I just need it
  > clarified what we want to do, esp -> does the index need to be
rebuilt?
  > 
  > [1]
  > 
  > I currently have 2 variations on the index, one w/ the default
settings
  > and another with the Similarity code Chuck attached to the bug
report.
  > Do we need other variations on the index e.g. with different
weights, or
  >   during indexing are the weights less important than the log() vs.
  > sqrt(log()) issue?
  > 
  > [2]
  > 
  > I guess it's obvious from the above, but just to make it clear -
I'll
  > change the page to only do single field queries - but how many
  > variations do we want to see in parallel - the current page shows
2x2
  > results, for each combo of index and query - but I, say, show
several
  > more queries in parallel w/ different weights...
  > 
  > 
  > >
  > > Doug
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > > For additional commands, e-mail:
lucene-dev-help@jakarta.apache.org
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message