lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 21:05:46 GMT
David Spencer wrote:
> I worked w/ Chuck to get up a test page that shows search results with 2 
> versions of Similarity side by side.

David,

This looks great!  Thanks for doing this.

Is the default operator AND or OR?  It appears to be OR, but it should 
probably be AND.  That's become the industry standard since QueryParser 
was first written.  Also, any chance we can get explanations for hits?

It is difficult to decipher what's doing what.  I think we should 
separately evaluate query formulation and boosting from changes to tf/idf.

We ought to first compare searching body only, ignoring titles, then 
subsequently try different query formulations over multiple fields with 
a fixed weighting algorithm.  Yes, ignoring titles when searching 
wikipedia might not be the best approach, but the point is not to 
over-optimize for wikipedia but rather to find algorithms that work well 
with general text collections.  Removing titles makes the problem 
harder, which should in turn make it easier to see deficiencies.

Simpler yet, we ought to first try body-only with no proximity, just 
AND, in order to select good tf/idf formulations.  Then we should add 
auto-proximity searching into the mix, and finally add multiple fields. 
  Does this make sense?

MultiFieldQueryParser is known to be deficient.  A better 
general-purpose multi-field query formulator might be like that used by 
Nutch. It would translate a query "t1 t2" given fields f1 and f2 into 
something like:

+(f1:t1^b1 f2:t1^b2)
+(f2:t1^b1 f2:t2^b2)
f1:"t1 t2"~s1^b3
f2:"t1 t2"~s2^b4

Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for 
phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. 
We'd really only need to vary b1 and b3, and could use 1.0 for b2 and b4 
and infinity for s1 and s2.

Do folks agree that this is a good general formulation?  If so, would 
someone like to contribute a version of MultiFieldQueryParser that 
implements this?  The API should probably be something like:

   static Query parse(String queryString,
                      String[] fields,
                      float[] boolBoosts,
                      float[] phraseBoosts,
                      int[] slops);

A simplified version might be:

   static Query parse(String queryString,
                      String[] fields,
                      float[] boosts);

This could use infinity for slops and assume boolBoosts[i] == 
phraseBoosts[i].

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message