lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject SpanQuery and BoostingTermQuery oddities
Date Wed, 05 Aug 2009 14:02:39 GMT
A BoostingTermQuery (BTQ) is a SpanQuery.

If I run:
     IndexSearcher searcher = new IndexSearcher(dir, true);
     searcher.setSimilarity(payloadSimilarity);//set the similarity.   
Very important
     BoostingTermQuery btq = new BoostingTermQuery(new Term("body",  
"fox"));
     TopDocs topDocs = searcher.search(btq, 10);
     printResults(searcher, btq, topDocs);

I get, as expected, documents that contain "fox" with a payload  
boosted higher than those containing fox without a boost.  (See [1]  
for full code)
Output is:
Doc: doc=0 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 0), product of:
   7.071068 = (MATCH) btq, product of:
     0.70710677 = tf(phraseFreq=0.5)
     10.0 = scorePayload(...)
   1.9162908 = idf(body: fox=3)
   0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=2 score=4.2344446
Explain: 4.234444 = (MATCH) fieldWeight(body:fox in 2), product of:
   7.071068 = (MATCH) btq, product of:
     0.70710677 = tf(phraseFreq=0.5)
     10.0 = scorePayload(...)
   1.9162908 = idf(body: fox=3)
   0.3125 = fieldNorm(field=body, doc=2)

Doc: doc=1 score=0.42344445
Explain: 0.42344445 = (MATCH) fieldWeight(body:fox in 1), product of:
   0.70710677 = (MATCH) btq, product of:
     0.70710677 = tf(phraseFreq=0.5)
     1.0 = scorePayload(...)
   1.9162908 = idf(body: fox=3)
   0.3125 = fieldNorm(field=body, doc=1)



However, if I then add the BTQ to a SpanNearQuery, I do not get the  
expected results:
     SpanQuery[] queries = new SpanQuery[2];
     queries[0] = new BoostingTermQuery(new Term("body", "red"));
     queries[1] = new BoostingTermQuery(new Term("body", "fox"));
     SpanNearQuery near = new SpanNearQuery(queries, 2, true);
     topDocs = searcher.search(near, 10);
     printResults(searcher, near, topDocs);

Output is:
Doc: doc=0 score=0.6914818
Explain: 0.6914818 = (MATCH) fieldWeight(body:spanNear([red, fox], 2,  
true) in 0), product of:
   0.57735026 = tf(phraseFreq=0.33333334)
   3.8325815 = idf(body: fox=3 red=3)
   0.3125 = fieldNorm(field=body, doc=0)

Doc: doc=1 score=0.6914818
Explain: 0.6914818 = (MATCH) fieldWeight(body:spanNear([red, fox], 2,  
true) in 1), product of:
   0.57735026 = tf(phraseFreq=0.33333334)
   3.8325815 = idf(body: fox=3 red=3)
   0.3125 = fieldNorm(field=body, doc=1)

Doc: doc=2 score=0.6914818
Explain: 0.6914818 = (MATCH) fieldWeight(body:spanNear([red, fox], 2,  
true) in 2), product of:
   0.57735026 = tf(phraseFreq=0.33333334)
   3.8325815 = idf(body: fox=3 red=3)
   0.3125 = fieldNorm(field=body, doc=2)


It seems the BTQ score method is not being called.  One of the main  
points of the SpanNearQuery is that it can take in complex subclauses,  
presumably rolling up scores from the subclauses.  Yet that appears to  
not be the case.  Instead it just seems to rely on the matches that  
get produced by those subclauses, but not the scoring.  Is my  
understanding correct?  If so, is that the correct functionality?

I'm not a spans expert (SpanNearQuery always confuses me with the  
NearSpansOrdered/Unordered), but it seems like the SpanNearQuery (and  
likely others that take clauses) needs to create a QueryWeight object  
that is made up of the QueryWeight objects from it's subclauses, right?

Thoughts?

Thanks,
Grant

[1]
import junit.framework.TestCase;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.PayloadEncoder;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.payloads.BoostingTermQuery;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;

import java.io.Reader;
import java.io.IOException;


/**
  *
  *
  **/
public class PayloadTest extends TestCase {
Directory dir;



   public static String[] DOCS = {
           "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the lazy| 
2.0 brown|2.0 dogs|10.0",
           "The quick red fox jumped over the lazy brown dogs",//no  
boosts
           "The quick|2.0 red|2.0 fox|10.0 jumped|5.0 over the old|2.0  
brown|2.0 box|10.0",
           "Mary|10.0 had a little|2.0 lamb|10.0 whose fleece|10.0 was| 
5.0 white|2.0 as snow|10.0",
           "Mary had a little lamb whose fleece was white as snow",
           "Mary|10.0 takes on Wolf|10.0 Restoration|10.0 project|10.0  
despite ties|10.0 to sheep|10.0 farming|10.0",
           "Mary|10.0 who lives|5.0 on a farm|10.0 is|5.0 happy|2.0  
that she|10.0 takes|5.0 a walk|10.0 every day|10.0",
           "Moby|10.0 Dick|10.0 is|5.0 a story|10.0 of a whale|10.0  
and a man|10.0 obsessed|10.0",
           "The robber|10.0 wore|5.0 a black|2.0 fleece|10.0 jacket| 
10.0 and a baseball|10.0 cap|10.0",
           "The English|10.0 Springer|10.0 Spaniel|10.0 is|5.0 the  
best|2.0 of all dogs|10.0"
   };
   protected PayloadSimilarity payloadSimilarity;

   @Override
   protected void setUp() throws Exception {
     dir = new RAMDirectory();

     PayloadEncoder encoder = new FloatEncoder();
     IndexWriter writer = new IndexWriter(dir, new  
PayloadAnalyzer(encoder), true, IndexWriter.MaxFieldLength.UNLIMITED);
     payloadSimilarity = new PayloadSimilarity();
     writer.setSimilarity(payloadSimilarity);
     for (int i = 0; i < DOCS.length; i++) {
       Document doc = new Document();
       Field id = new Field("id", "doc_" + i, Field.Store.YES,  
Field.Index.NOT_ANALYZED_NO_NORMS);
       doc.add(id);
       //Store both position and offset information
       Field text = new Field("body", DOCS[i], Field.Store.NO,  
Field.Index.ANALYZED);
       doc.add(text);
       writer.addDocument(doc);
     }
     writer.close();
   }


   public void testPayloads() throws Exception {
     IndexSearcher searcher = new IndexSearcher(dir, true);
     searcher.setSimilarity(payloadSimilarity);//set the similarity.   
Very important
     BoostingTermQuery btq = new BoostingTermQuery(new Term("body",  
"fox"));
     TopDocs topDocs = searcher.search(btq, 10);
     printResults(searcher, btq, topDocs);
     System.out.println("-----------");
     System.out.println("Try out some Spans");
     SpanQuery[] queries = new SpanQuery[2];
     queries[0] = new BoostingTermQuery(new Term("body", "red"));
     queries[1] = new BoostingTermQuery(new Term("body", "fox"));
     SpanNearQuery near = new SpanNearQuery(queries, 2, true);
     topDocs = searcher.search(near, 10);
     printResults(searcher, near, topDocs);

   }

   private void printResults(IndexSearcher searcher, Query btq,  
TopDocs topDocs) throws IOException {
     for (int i = 0; i < topDocs.scoreDocs.length; i++) {
       ScoreDoc doc = topDocs.scoreDocs[i];
       System.out.println("Doc: " + doc.toString());
       System.out.println("Explain: " + searcher.explain(btq, doc.doc));
     }
   }

   class PayloadSimilarity extends DefaultSimilarity {
     @Override
     public float scorePayload(String fieldName, byte[] bytes, int  
offset, int length) {
       return PayloadHelper.decodeFloat(bytes, offset);//we can ignore  
length here, because we know it is encoded as 4 bytes
     }
   }

   class PayloadAnalyzer extends Analyzer {
     private PayloadEncoder encoder;

     PayloadAnalyzer(PayloadEncoder encoder) {
       this.encoder = encoder;
     }

     public TokenStream tokenStream(String fieldName, Reader reader) {
       TokenStream result = new WhitespaceTokenizer(reader);
       result = new LowerCaseFilter(result);
       result = new DelimitedPayloadTokenFilter(result, '|', encoder);
       return result;
     }
   }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message