lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <>
Subject RE: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 21:21:55 GMT
Doug Cutting wrote:
  > Is the default operator AND or OR?  It appears to be OR, but it
  > probably be AND.  That's become the industry standard since
  > was first written.  Also, any chance we can get explanations for

Explanations are available.  Click the score link on a result.

  > It is difficult to decipher what's doing what.  I think we should
  > separately evaluate query formulation and boosting from changes to
  > tf/idf.

Earlier I proposed the opposite as my mechanism is designed to work in
concert:  i.e., the Similarity and the query parsing work together.
Most real collections have at least title and body fields.  We decided
to look at the combined structure and compare results, then dig into
individual details as appropriate to understand the results.

The analysis can be approached bottom-up, a factor at a time, or top
down, looking at two complete formulations and then dissecting them to
further understand their differences.

I think the differences are pretty clear as the systems stands.  Notice
a substantial difference in the idf's in the respective explanations.  I
continue to think the current mechanism weights these too high,
primarily due to its squaring.

The other big difference occurs when all query terms are not required,
as the current mechanism then does not consider term diversity (e.g., t1
in title and in content gets as a good a score as t1 in title and t2 in
content), while the new approach does.

  > MultiFieldQueryParser is known to be deficient.  A better
  > general-purpose multi-field query formulator might be like that used
  > Nutch. It would translate a query "t1 t2" given fields f1 and f2
  > something like:
  > +(f1:t1^b1 f2:t1^b2)
  > +(f2:t1^b1 f2:t2^b2)
  > f1:"t1 t2"~s1^b3
  > f2:"t1 t2"~s2^b4

This does not seem scalable.  How do you expand a general query with n
terms?  I believe Dave has some code that generates all the pairwise
combinations, but this is quadratic in the length of the query and it
doesn't consider proximity of larger collections of query terms.

I sent a not earlier today suggesting that a new Query class is needed
that simultaneously handles multiple fields, term diversity and term

  > Do folks agree that this is a good general formulation?

Not unless it is scalable and the desire is to require all query terms.
I would rather not require all query terms, which introduces a more
complex diversity requirement (ensure that as many distinct query terms
as possible are matched somewhere).

I'm interested in solving this problem and would be happy to contribute
whatever I write.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message