lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 22:41:24 GMT
Doug Cutting wrote:

> David Spencer wrote:
>> I worked w/ Chuck to get up a test page that shows search results with 
>> 2 versions of Similarity side by side.
> David,
> This looks great!  Thanks for doing this.
> Is the default operator AND or OR?  It appears to be OR, but it should 
> probably be AND.  That's become the industry standard since QueryParser 
> was first written.  Also, any chance we can get explanations for hits?
> It is difficult to decipher what's doing what.  I think we should 
> separately evaluate query formulation and boosting from changes to tf/idf.
> We ought to first compare searching body only, ignoring titles, then 

Well a step in the direction of analyzing things step by step is that I 
now show a 2x2 matrix of search results, each each combo of Similarity 
an query parser:

Upper left cell is the pure default case.
Bottom right cell is the case of 2 new things (new Similarity, new query 
The 2 other cells just have 1 "variable changed....see the row/col 
labels to decipher.

There's no reason I can't also toss in a row for a 3rd query (say, body 
only), or a 4th (with phrases..) - this is just a step, which I hope 
doesn't confuse the issue.

The more general form is that for "n" indexes and "m" query parsers we 
can show a matrix of n cols by m rows...

> subsequently try different query formulations over multiple fields with 
> a fixed weighting algorithm.  Yes, ignoring titles when searching 
> wikipedia might not be the best approach, but the point is not to 
> over-optimize for wikipedia but rather to find algorithms that work well 
> with general text collections.  Removing titles makes the problem 
> harder, which should in turn make it easier to see deficiencies.
> Simpler yet, we ought to first try body-only with no proximity, just 
> AND, in order to select good tf/idf formulations.  Then we should add 
> auto-proximity searching into the mix, and finally add multiple fields. 
>  Does this make sense?
> MultiFieldQueryParser is known to be deficient.  A better 
> general-purpose multi-field query formulator might be like that used by 
> Nutch. It would translate a query "t1 t2" given fields f1 and f2 into 
> something like:
> +(f1:t1^b1 f2:t1^b2)
> +(f2:t1^b1 f2:t2^b2)
> f1:"t1 t2"~s1^b3
> f2:"t1 t2"~s2^b4
> Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for 
> phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd 
> really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and 
> infinity for s1 and s2.
> Do folks agree that this is a good general formulation?  If so, would 
> someone like to contribute a version of MultiFieldQueryParser that 
> implements this?  The API should probably be something like:
>   static Query parse(String queryString,
>                      String[] fields,
>                      float[] boolBoosts,
>                      float[] phraseBoosts,
>                      int[] slops);
> A simplified version might be:
>   static Query parse(String queryString,
>                      String[] fields,
>                      float[] boosts);
> This could use infinity for slops and assume boolBoosts[i] == 
> phraseBoosts[i].
> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message