lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gmuresan <gmure...@acm.org>
Subject Re: Score combination - Filtering vs. Querying
Date Thu, 16 Jun 2011 02:32:30 GMT
...
I've read more forum discussions on this issue and some people point out
(like LIA 2nd ed, p.183, does) that using a filter reduces the number of
documents under consideration and impacts IDF and therefore the overall
score. Moreover, the recommendation in such forum discussions is that,
unless a high performance gain can be obtained via CachingWrapperFilter, 
MUST BooleanClauses are preferred to Filters.

This doesn't quite make sense to me: the number of documents in the
collection, the size of the vocabulary, the size of each posting list and
the IDF of each term are known after indexing and should not be affected by
filtering.

To test this, I further modified the same LIA example and compared the use
of a BooleanClause and the use of a Filter:

Q = category:/technology/computers/programming/methodology
category:/philosophy/eastern +pubmonth:[200501 TO 201012]
----------
Tao Te Ching ???
1.4739084 = (MATCH) product of:
  2.2108626 = (MATCH) sum of:
    1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
      0.68659997 = queryWeight(category:/philosophy/eastern), product of:
        2.871802 = idf(docFreq=1, maxDocs=13)
        0.23908332 = queryNorm
      2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
        1.0 = tf(termFreq(category:/philosophy/eastern)=1)
        2.871802 = idf(docFreq=1, maxDocs=13)
        1.0 = fieldNorm(field=category, doc=4)
    0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]),
product of:
      1.0 = boost
      0.23908332 = queryNorm
  0.6666667 = coord(2/3)

Q = +(category:/technology/computers/programming/methodology
category:/philosophy/eastern) +pubmonth:[200501 TO 201012]
----------
Tao Te Ching ???
1.224973 = (MATCH) sum of:
  0.9858896 = (MATCH) product of:
    1.9717792 = (MATCH) sum of:
      1.9717792 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
        0.68659997 = queryWeight(category:/philosophy/eastern), product of:
          2.871802 = idf(docFreq=1, maxDocs=13)
          0.23908332 = queryNorm
        2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
          1.0 = tf(termFreq(category:/philosophy/eastern)=1)
          2.871802 = idf(docFreq=1, maxDocs=13)
          1.0 = fieldNorm(field=category, doc=4)
    0.5 = coord(1/2)
  0.23908332 = (MATCH) ConstantScoreQuery(pubmonth:[200501 TO 201012]),
product of:
    1.0 = boost
    0.23908332 = queryNorm

Q = category:/technology/computers/programming/methodology
category:/philosophy/eastern
Date = pubmonth:[200501 TO 201112]
----------
Tao Te Ching ???
1.0153353 = (MATCH) product of:
  2.0306706 = (MATCH) sum of:
    2.0306706 = (MATCH) weight(category:/philosophy/eastern in 4), product
of:
      0.70710677 = queryWeight(category:/philosophy/eastern), product of:
        2.871802 = idf(docFreq=1, maxDocs=13)
        0.24622406 = queryNorm
      2.871802 = (MATCH) fieldWeight(category:/philosophy/eastern in 4),
product of:
        1.0 = tf(termFreq(category:/philosophy/eastern)=1)
        2.871802 = idf(docFreq=1, maxDocs=13)
        1.0 = fieldNorm(field=category, doc=4)
  0.5 = coord(1/2)

Comparing the results, I see that:
	- maxDocs and IDF are the same;
	- queryNorm and coord can be different. The correct values are the ones
obtained when using Filter; BooleanClauses introduce artificial query terms
that affect these metrics;
	- the BooleanClause also introduces a ConstantScoreQuery that further
impacts the "true" score.

I would conclude that from the perspective of obtaining "true" scores, using
Filter is preferred to using MUST BooleanClause in a BooleanQuery.

The TF-IDF model (as well as other IR models) was developed for text-like
features. The assumptions made in that model do not apply to numeric fields
such as date or longitude/latitude, appropriate for faceted filtering, so
the two models should not be mixed in a common query.

Q3. Considering that all expert opinions that I've read in forums speak
against Filter-ing, is there something that I'm missing ?


--
View this message in context: http://lucene.472066.n3.nabble.com/Score-combination-Filtering-vs-Querying-tp3070425p3070439.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Mime
View raw message