lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-...@tropo.com>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Mon, 31 Jan 2005 23:25:46 GMT
Doug Cutting wrote:

> David Spencer wrote:
> 
>> I worked w/ Chuck to get up a test page that shows search results with 
>> 2 versions of Similarity side by side.
> 
> 
> David,
> 
> This looks great!  Thanks for doing this.
> 
> Is the default operator AND or OR?  It appears to be OR, but it should 
> probably be AND.  That's become the industry standard since QueryParser 
> was first written.  Also, any chance we can get explanations for hits?
> 
> It is difficult to decipher what's doing what.  I think we should 
> separately evaluate query formulation and boosting from changes to tf/idf.
> 
> We ought to first compare searching body only, ignoring titles, then 
> subsequently try different query formulations over multiple fields with 
> a fixed weighting algorithm.  Yes, ignoring titles when searching 
> wikipedia might not be the best approach, but the point is not to 
> over-optimize for wikipedia but rather to find algorithms that work well 
> with general text collections.  Removing titles makes the problem 
> harder, which should in turn make it easier to see deficiencies.
> 
> Simpler yet, we ought to first try body-only with no proximity, just 
> AND, in order to select good tf/idf formulations.  Then we should add 
> auto-proximity searching into the mix, and finally add multiple fields. 
>  Does this make sense?
> 
> MultiFieldQueryParser is known to be deficient.  A better 
> general-purpose multi-field query formulator might be like that used by 
> Nutch. It would translate a query "t1 t2" given fields f1 and f2 into 
> something like:
> 
> +(f1:t1^b1 f2:t1^b2)
> +(f2:t1^b1 f2:t2^b2)
> f1:"t1 t2"~s1^b3
> f2:"t1 t2"~s2^b4
> 
> Where b1 and b2 are boosts for f1 and f2, and b3 and b4 are boosts for 
> phrase matching in f1 and f2, and s1 and s2 are slop for f1 and f2. We'd 
> really only need to vary b1 and b3, and could use 1.0 for b2 and b4 and 
> infinity for s1 and s2.
> 
> Do folks agree that this is a good general formulation?  If so, would 
> someone like to contribute a version of MultiFieldQueryParser that 
> implements this?  The API should probably be something like:
> 
>   static Query parse(String queryString,
>                      String[] fields,
>                      float[] boolBoosts,
>                      float[] phraseBoosts,
>                      int[] slops);
> 
> A simplified version might be:
> 
>   static Query parse(String queryString,
>                      String[] fields,
>                      float[] boosts);


I think I've done the code (but no, test URL we're playing with is not 
updated).


[1] Test Driver:


// 1a: "AND" semantics		
q = formMegaQuery( "t1 t2",
         null,
         FIELDS,
         BOOL_BOOSTS,
         PH_BOOSTS,
         SLOPS,
         true);  // true -> AND

o.println( q.toString( "f2"));


// 1b: same as 1a but OR semantics
q = formMegaQuery( "t1 t2",
         null,
         FIELDS,
         BOOL_BOOSTS,
         PH_BOOSTS,
         SLOPS,
         false);

o.println( q.toString( "f2"));

// 1c: more terms
q = formMegaQuery( "t1 t2 t3 t4 t5",
         null,
         FIELDS,
         BOOL_BOOSTS,
         PH_BOOSTS,
         SLOPS,
         false);

o.println( q.toString( "f2"));


[2] Output

+(f1:t1^2.0 t1) +(f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5

(f1:t1^2.0 t1) (f1:t2^2.0 t2) f1:"t1 t2"~5^3.0 "t1 t2"~2^1.5

(f1:t1^2.0 t1) (f1:t2^2.0 t2) (f1:t3^2.0 t3) (f1:t4^2.0 t4) (f1:t5^2.0 
t5) f1:"t1 t2 t3 t4 t5"~5^3.0 "t1 t2 t3 t4 t5"~2^1.5



[3] Code - more or less as per Doug's spec but I pass in an optional 
Analyzer for parsing the search string, and the last arg, 'mand', 
determines "AND semantics".

public static Query formMegaQuery( String srch,
                                    Analyzer a,
                                    String[] fields,
                                    float[] boolBoosts,
                                    float[] phraseBoosts,
                                    int[] slops,
                                    boolean mand)
{
     if ( a == null) a = new WhitespaceAnalyzer();
     BooleanQuery bq = new BooleanQuery();

     TokenStream ts = a.tokenStream( "contents", new StringReader( srch));
     org.apache.lucene.analysis.Token toke;
     try
     {
         TermQuery[] tt = new TermQuery[ fields.length];
         List lis = new LinkedList();

         // [1] For every word make a clause so it matches some field
         while ( (toke = ts.next()) != null) // for every token in 
search string
         {
             String word = toke.termText();
             if ( ! lis.add( word)) continue; // ignore dup words

             BooleanQuery tmp = new BooleanQuery();
             for ( int i = 0; i < tt.length; i++)
             {
                 tt[ i] = new TermQuery( new Term( fields[ i], word));
                 tt[ i].setBoost( boolBoosts[ i]);
                 tmp.add( tt[ i], false, false);
             }
             bq.add( tmp, mand,  false); // must match one if 'mand' is 
true (AND semantics)
         }

         String[] ar = (String[]) lis.toArray( new String[ 0]);
         for ( int j = 0; j < fields.length; j++) // for every field
         {
             PhraseQuery pq = new PhraseQuery();
             for ( int i = 0; i < ar.length; i++)
                 pq.add( new Term( fields[ j], ar[ i]));
             pq.setSlop( slops[ j]);
             pq.setBoost( phraseBoosts[ j]);
             bq.add( pq, false, false); // make opt
         }
     }
     catch( IOException ioe)
     {
         // can't happen as we're using a string reader
     }
     return bq;
}

> 
> This could use infinity for slops and assume boolBoosts[i] == 
> phraseBoosts[i].
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message