lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujit Pal <sujit....@comcast.net>
Subject Payload Query and Document Boosts
Date Thu, 13 Oct 2011 01:16:41 GMT
Hi, 

Question about Payload Query and Document Boosts. We are using Lucene
3.2 and Payload queries, with our own PayloadSimilarity class which
overrides the scorePayload method like so:

{code}
  @Override
  public float scorePayload(int docId, String fieldName,
      int start, int end, byte[] payload, int offset, int length) {
    if (payload != null) {
      return PayloadHelper.decodeFloat(payload, offset);
    } else {
      return 1.0F;
    }
  }
{/code}

We are injecting payloads as ID$SCORE pairs using the
DelimitedPayloadTokenFilter and life was good - when we run
PayloadTermQuery() the scores came back as our score. I have included
code below that illustrates the calling pattern, its this:

{code}
    PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
"2790926"), new AveragePayloadFunction(), false);
{/code}

ie, do not include the span score (the SCORE is calculated as a result
of offline processing and we can't change that value).

Now we would like to boost each document differently (index time,
document.setBoost(boost), based on its content type), and we are running
into problems. Looks like the document boost is not applied to the
document score during search if includeSpanScore==false. When we set it
to true, we see a difference in scores (the original score without
document boosts is multiplied by the document boost set), but the
original scores without boost is not the same as SCORE, ie its now
affected by the span score.

My question is - is there some method in DefaultSimilarity that I can
override so that my score is my original SCORE * document boost? The
Similarity documentation does not provide any clues to my problem - I
tried modifying the computeNorm() method to return state.getBoost() but
it looks like its never called.

If not, the other option would be to bake in the doc boost into the
SCORE value, by multiplying them on their way into lucene, so that now
SCORE *= doc boost.

Here is my unit test which illustrates the issue:

{code}
import java.io.Reader;
import java.util.HashMap;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

import com.healthline.query.kb.ConceptAnalyzer;
import com.healthline.solr.HlSolrConstants;
import com.healthline.solr.search.PayloadSimilarity;
import com.healthline.util.Config;

public class DocBoostTest {

  private class PayloadAnalyzer extends Analyzer {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream tokens = new
WhitespaceTokenizer(HlSolrConstants.CURRENT_VERSION, reader);
      tokens = new DelimitedPayloadTokenFilter(tokens, '$', new
FloatEncoder());
      return tokens;
    }
  };

  private Analyzer getAnalyzer() {
    Map<String,Analyzer> pfas = new HashMap<String,Analyzer>();
    pfas.put("imuids_p", new PayloadAnalyzer());
    PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
      new ConceptAnalyzer(), pfas);
    return analyzer;
  }
  
  private IndexSearcher loadTestData(boolean setBoosts) throws Exception
{
    RAMDirectory ramdir = new RAMDirectory();
    IndexWriterConfig iwconf = new IndexWriterConfig(
      HlSolrConstants.CURRENT_VERSION, getAnalyzer());
    iwconf.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(ramdir, iwconf);
    Document doc1 = new Document();
    doc1.add(new Field("itemtitle", "Cancer and the Nervous System
PARANEOPLASTIC DISORDERS", Store.YES, Index.ANALYZED));
    doc1.add(new Field("imuids_p", "2790917$52.01 2790926$53.18",
Store.YES, Index.ANALYZED));
    doc1.add(new Field("contenttype", "BK", Store.YES,
Index.NOT_ANALYZED));
    if (setBoosts) doc1.setBoost(1.2F);
    writer.addDocument(doc1);
    Document doc2 = new Document();
    doc2.add(new Field("itemtitle", "Esophagogastric cancer: Targeted
agents", Store.YES, Index.ANALYZED));
    doc2.add(new Field("imuids_p", "2790926$52.18 2790981$5.19",
Store.YES, Index.ANALYZED));
    doc2.add(new Field("contenttype", "JL", Store.YES,
Index.NOT_ANALYZED));
    if (setBoosts) doc2.setBoost(1.5F);
    writer.addDocument(doc2);
    writer.commit();
    writer.close();
    return new IndexSearcher(ramdir);
  }
  
  @Test
  public void testConceptScoringWithoutBoost() throws Exception {
    Config.setConfigDir("/prod/web/config");
    IndexSearcher searcher = loadTestData(false);
    searcher.setSimilarity(new PayloadSimilarity());
    PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
"2790926"), 
      new AveragePayloadFunction(), false);
    ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
    System.out.println("Concept result without boosting");
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      String contentType = doc.get("contenttype");
      String title = doc.get("itemtitle");
      System.out.println(hits[i].doc + ": " + title + "/" + contentType
+ 
        " (" + hits[i].score + ")");
    }
    searcher.close();
  }
  
  @Test
  public void testConceptScoringWithContentTypeBoost() throws Exception
{
    Config.setConfigDir("/prod/web/config");
    IndexSearcher searcher = loadTestData(true);
    searcher.setSimilarity(new PayloadSimilarity());
    PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
"2790926"), 
      new AveragePayloadFunction(), false);
    ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
    System.out.println("Concept result with boosting");
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      String contentType = doc.get("contenttype");
      String title = doc.get("itemtitle");
      System.out.println(hits[i].doc + ": " + title + "/" + contentType
+ 
        " (" + hits[i].score + ")");
    }
    searcher.close();
  }
  
  @Test
  public void testFulltextScoringWithoutBoost() throws Exception {
    Config.setConfigDir("/prod/web/config");
    IndexSearcher searcher = loadTestData(false);
    QueryParser parser = new
QueryParser(HlSolrConstants.CURRENT_VERSION, 
      "itemtitle", getAnalyzer());
    Query q = parser.parse("cancer");
    ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
    System.out.println("Fulltext result without boosting");
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      String contentType = doc.get("contenttype");
      String title = doc.get("itemtitle");
      System.out.println(hits[i].doc + ": " + title + "/" + contentType
+ 
        " (" + hits[i].score + ")");
    }
    searcher.close();
  }
  
  @Test
  public void testFulltextScoringWithContentTypeBoost() throws Exception
{
    Config.setConfigDir("/prod/web/config");
    IndexSearcher searcher = loadTestData(true);
    QueryParser parser = new
QueryParser(HlSolrConstants.CURRENT_VERSION, 
      "itemtitle", getAnalyzer());
    Query q = parser.parse("cancer");
    ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
    System.out.println("Fulltext result with boosting");
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      String contentType = doc.get("contenttype");
      String title = doc.get("itemtitle");
      System.out.println(hits[i].doc + ": " + title + "/" + contentType
+ 
        " (" + hits[i].score + ")");
    }
    searcher.close();
  }
}
{/code}

With the includeSpanScore==false, I get the following results from this
unit test. The scores are the same as what I put in, but document boost
has no effect.

{code}
    [junit] ------------- Standard Output ---------------
    [junit] Concept result without boosting
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(53.18)
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18)
    [junit] Concept result with boosting
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(53.18)
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18)
    [junit] Fulltext result without boosting
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674)
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(0.26010898)
    [junit] Fulltext result with boosting
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011)
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(0.2972674)
    [junit] ------------- ---------------- ---------------
{/code}

and with includeSpanScores==true, I get the following results. This
time, the doc boosts do affect the payload query scores, but the
original scores (before boosting) is different from the score pair I put
in.

{code}
    [junit] Concept result without boosting
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(13.973032)
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (13.710282)
    [junit] Concept result with boosting
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (21.936451)
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(16.767637)
    [junit] Fulltext result without boosting
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674)
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(0.26010898)
    [junit] Fulltext result with boosting
    [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011)
    [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
(0.2972674)
    [junit] ------------- ---------------- ---------------
{/code}

TIA for any help you can provide.

-sujit



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message