lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <>
Subject RE: Scoring benchmark evaluation. Was RE: How to proceed with Bug31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 18:05:49 GMT
Miles Barr wrote:
  > Are there any plans to unify your classes with the
  > MultiFieldQueryParser? I think eventually it would make sense to
  > generate the queries during parsing rather than rebuilding them.

I don't plan this integration for the current expansion, but see below.

  > But there are benefits to the current technique. I've had some
  > experiences where the generated query doesn't work with the text
  > highlighting package, so it's useful to keep the original query

I have integration with the highlighter on my schedule and will attempt
to fix any problems there (with this expansion or the improved one

  > Another feature for an advanced query parser might be the ability to
  > alter the query class used depending on the field type. e.g. for
  > fields you would use TermQueries and PhraseQueries but for a
  > field it would use spans instead (as described in the Lucene book).

Adding a nearness heuristic to the expansion is important.  The
SpansQuery mechanism is one possibility, and there has been some recent
disjunction on this list about a phrase query with a large slop.
However, it doesn't appear to me that either of those meet the
requirements of multi-field searching.  It is not required that all
terms be in all fields, and should be optional whether or not all terms
need to be in some field.  Matches with terms closer together in a given
field should score higher than otherwise-equivalent matches with terms
further apart.

I think implementing the desired behavior requires a new query class
that intrinsically manages multiple fields, term diversity across fields
and term proximity within fields.  DistributingMultiFieldQueryParser
with MaxDisjunctionQuery/Scorer handle the multiple fields and term
diversity, but not the proximity.  I think there are non-scalable
expansions that could achieve term-proximity favoring within the
mechanism, but they would not perform well, especially for longer

If there isn't already something that achieves all 3 properties floating
around somewhere, I'm going to look into writing it.  It would make
sense for an intrinsic query like this to be integrated with


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message