lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 21:43:56 GMT
Chuck Williams wrote:
> I think the differences are pretty clear as the systems stands.  Notice
> a substantial difference in the idf's in the respective explanations.  I
> continue to think the current mechanism weights these too high,
> primarily due to its squaring.
> The other big difference occurs when all query terms are not required,
> as the current mechanism then does not consider term diversity (e.g., t1
> in title and in content gets as a good a score as t1 in title and t2 in
> content), while the new approach does.

Right.  I'd like to be able to separately discuss such issues and how to 
fix them.  Confounding them makes changes to Lucene an all-or-nothing 
proposition.  What will be easiest procedurally is to make a series of 
uncontroversial, clear improvements to the code, not wholesale 
replacements.  In the end we may get to the same place, but we'll still 
have more people on board.  I don't think a revolution is required, just 
some evolution.

If we want to change the way idf is used, is there a reason we cannot 
evaluate that change on its own, then, once that's settled, move on to 
the next issue?  We may find that some things cannot be changed in 
isolation, my guess is that idf and "term diversity" can and should be 
discussed separately.

>   It would translate a query "t1 t2" given fields f1 and f2
> into
>   > something like:
>   > 
>   > +(f1:t1^b1 f2:t1^b2)
>   > +(f2:t1^b1 f2:t2^b2)
>   > f1:"t1 t2"~s1^b3
>   > f2:"t1 t2"~s2^b4
> This does not seem scalable.  How do you expand a general query with n
> terms?

Perhaps my example was unclear.  Here's a three term query:

+(f1:t1^b1 f2:t1^b2)
+(f1:t2^b1 f2:t2^b2)
+(f1:t3^b1 f2:t3^b2)
f1:"t1 t2 t3"~s1^b3
f2:"t1 t2 t3"~s2^b4

Is that any clearer?

> I sent a not earlier today suggesting that a new Query class is needed
> that simultaneously handles multiple fields, term diversity and term
> proximity.

Is that distinct from my goal to develop an improved 
MultiFieldQueryParser for Lucene 2.0?

>   > Do folks agree that this is a good general formulation?
> Not unless it is scalable and the desire is to require all query terms.

I'm not sure what you mean by scalable.

> I would rather not require all query terms, which introduces a more
> complex diversity requirement (ensure that as many distinct query terms
> as possible are matched somewhere).

Requiring all query terms is acceptable and even expected by most 
searchers today.  All of the major web search engines implement this, 
and that's where folks learn to search today.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message