lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 23:03:30 GMT
Chuck Williams wrote:
> That expansion is scalable, but it only accounts for proximity of all
> query terms together.  E.g., it does not favor a match where t1 and t2
> are close together while t3 is distant over a match where all 3 terms
> are distant.  Worse, it would not favor a match with t1 and t2 in a
> short title, and t2 and t3 proximal in the content (with no occurrence
> of t1 in the content) vs. a match with t1 and t2 in the title and t2 and
> t3 distant in the content.

Right.  I just mentioned this same weakness in a message replying to David.

>   > Is that distinct from my goal to develop an improved
>   > MultiFieldQueryParser for Lucene 2.0?
> Not distinct, but I think the first step is to decide on the expansion
> we want.  Unless somebody has a better idea, I think the best solution
> is a new Query class that simultaneously supports multiple fields, term
> diversity and term proximity.  It would be similar to SpansQuery, but
> generalized.  It would be like BooleanQuery in the sense that individual
> query clauses could be required or not.  Then, default AND could be
> achieved by expanding queries to all-required.
> With this new Query class, revised versions of QueryParser and
> MultiFieldQuery parser would generate it.
> Am I way off-base somewhere and/or is there a simpler approach to the
> same end?

It just sounds like a lot to bite off at once.

What did you think of my DensityPhraseQuery proposal?  We could use this 
in place of a PhraseQuery w/ slop=infinity.  We'd need just one per field.

The straight boolean clauses are required for two reasons:
   1. To make sure that every query term appears in some field; and
   2. To reward a term that occurs frequently in a field, but near no 
other query terms.

> Sure, idf is important enough to evaluate independently as a factor.
> However, I do not think these considerations are orthogonal.  For
> example, I'm putting a lot of weight in field boosting and don't want
> the preference of title matches over body matches to be overwhelmed by
> the idf's.

If field boosting needs to then trump idf, we should be able to deal 
with that when we subsequently tune field boosting, no?  We can, e.g., 
square the field boosts if we need.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message