lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gmuresan <gmure...@acm.org>
Subject Score combination - Filtering vs. Querying
Date Thu, 16 Jun 2011 02:24:56 GMT
The issue that I have is well exemplified by section 3.4.5 "Combining
queries: BooleanQuery" in LIA, 2nd ed. The example uses BooleanQuery to
combine
	- a TermQuery, for matching document topic, for which the TF-IDF scoring
makes sense; and
	- a NumericRangeQuery, whose purpose is to filter by publication date.

I extended the example code to output the query and the explanation:

Title AND Date = +subject:search +pubmonth:[201001 TO 201012]
----------
Lucene in Action, Second Edition
1.6848878 = (MATCH) sum of:
  1.3560408 = (MATCH) weight(subject:search in 9), product of:
    0.9443832 = queryWeight(subject:search), product of:
      2.871802 = idf(docFreq=1, maxDocs=13)
      0.3288469 = queryNorm
    1.435901 = (MATCH) fieldWeight(subject:search in 9), product of:
      1.0 = tf(termFreq(subject:search)=1)
      2.871802 = idf(docFreq=1, maxDocs=13)
      0.5 = fieldNorm(field=subject, doc=9)
  0.3288469 = (MATCH) ConstantScoreQuery(pubmonth:[201001 TO 201012]),
product of:
    1.0 = boost
    0.3288469 = queryNorm

Computing a queryNorm for the NumericRangeQuery has no meaning. Instead of
simply filtering by date, this component contributes a substantial amount
(0.3288) to the overall score (especially if the title match has a low
score).

In my own (inherited) application I have multiple textual queries, matching
against different fields, combined with several NumericRangeQueries. The
contributions of the latter to the scores makes it hard to control boosts of
different fields.

The logical course of action seems to me to replace the NumericRangeQueries
with filters.  This means removing the NumericRangeQueries from the overall
BooleanQuery and separately build a filter that combines corresponding
NumericRangeFilters. Several options that I have are:
	- Use BooleanFilter
	- Use ChainFilter
	- In order to change as little code as possible, keep the code that
combines all NumericRangeQueries into a BooleanQuery, and wrap that in a
QueryWrapperFilter.

Q1: Are there any (performance ?) advantages or disadvantages for each of
these options ?
Q2: Are there any plans to improve Lucene in terms of dealing in a
principled way with this issue of combining TermQueries and
NumericRangeQueries ?


--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3070425.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
View raw message