lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <>
Subject RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 18:05:26 GMT
Doug Cutting wrote:
  > That's a lot of functionality bundled into a single Query class!
  > rather make it possible to assemble this from reusable parts.  And
  > almost can be already.  Then we can offer such a thing pre-packaged.

That would be great, if it could be done.

  > So let me take it point-by-point:
  > 1a-c is the new MultiFieldQueryParser implementation.
  > 1d is Similarity.sloppyFreq()
  > 2 is BooleanQuery (except the weird optional stuff)

BooleanQuery does support the "weird optional stuff"; these are just
BooleanClauses that are neither required nor prohibited.  I don't
consider that "weird".

  > 3a is TermQuery and PhraseQuery
  > 3b is DensityPhraseQuery (to be implemented)
  > 3c is Similarity.coord()
  > So I think this can be implemented using the expansion I proposed
  > yesterday for MultiFieldQueryParser, plus something like my
  > DensityPhraseQuery and perhaps a few Similarity tweaks.

I don't think that works unless the mechanism is limited to default-AND
(i.e., all clauses required).  As soon as you support default-OR, then
what I've been calling the term diversity problem arises (which might
better be called the term coverage problem; i.e., ensure that matching
more terms in the query in some field is better than repeatedly matching
the same term in different fields).

I address the term coverage problem, without consideration of proximity,
by using DistributingMultiFieldQueryParser and MaxDisjunctionQuery.
These work well, as Dave's example site shows.

However, I don't see a way to integrate term proximity into that
expansion.  Specifically, I don't see a way to handle proximity and
coverage simultaneously without managing the multiple fields, field
boosts and proximity considerations in a single query class.  Whence,
the proposal for such a class.

Do you see a way to do that?  I.e., do you see a scalable expansion that
addresses all the issues for both default-or and default-and?  I think
the query class I've proposed does that, and should be no more complex
than the current SpanQuery mechanism, for example.  Also, it should be
more efficient than a nested construction of more primitive components
since it can be directly optimized.  I think this could make a
substantial improvement to Lucene's relevance ranking.

  > I wasn't arguing that we shouldn't alter the idf definition.
  > the opposite in fact.  If squaring idf is bad, then that should show
  > in single-field search and we can adjust it in that context.  You
  > claimed that good idf formulation is confounded with multi-field
  >   I do not believe that and that's what I was speaking to.  The
  > work you cite is all single-field stuff.

I didn't object to a single-field test.  I think my message started by
agreeing to that.  What I said that is that optimal idf-tuning is a
function of the fields and query expansions being used.  In general, I
believe in tuning relevance ranking per application.  In my experience,
this makes a huge difference.  E.g., Google's relevance ranking works
well on the web, but is known to produce poor results in typically
link-poor enterprise document repositories (there have been many
published comments about this, and I've competed with them directly and
demonstrated it to potential customers).


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message