lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujit Pal <sujit....@comcast.net>
Subject Re: Payload Query and Document Boosts
Date Thu, 13 Oct 2011 22:52:03 GMT
Just wanted to close the loop with some extra information I found during
looking at this issue further. From what I understand about the
norm(t,d) and computeNorm() what I want isn't exactly possible, but I
can get something close.

In any case, turns out we want to go with a more baroque boosting
mechanism (for the ID$SCORE queries) than document boosts will allow, so
for the moment we are going to keep useSpanScore==false and multiply the
score values as the data goes into Lucene.

Looking at the explanation (one such copy-pasted below), the difference
between the explanation values between the query against a unboosted doc
vs a boosted doc is in the fieldNorm value (one is 0.625 and the other
is 0.75, and the ratio is correct - my doc was boosted with 1.2F).

{code}
with document boosting
    [junit] 16.767637 = (MATCH) weight(imuids_p:2790926 in 0), product
of:
    [junit]   0.99999994 = queryWeight(imuids_p:2790926), product of:
    [junit]     0.5945349 = idf(imuids_p:  2790926=2)
    [junit]     1.681987 = queryNorm
    [junit]   16.76764 = (MATCH) fieldWeight(imuids_p:2790926 in 0),
product of:
    [junit]     37.60394 = (MATCH) btq, product of:
    [junit]       0.70710677 = tf(phraseFreq=0.5)
    [junit]       53.18 = scorePayload(...)
    [junit]     0.5945349 = idf(imuids_p:  2790926=2)
    [junit]     0.75 = fieldNorm(field=imuids_p, doc=0)
{/code}

I basically overrode the tf(), idf() and queryNorm() values to all
return 1 and modified computeNorm() to return
InvertedFieldState.getBoost(). The final result is scores which are
smaller than the SCORE values in the index but across results they are
comparable (ie, the scores differ by 1.2, because the fieldNorm differs
by 1.2).

>From what I've read so far, it appears that we cannot extract document
boost values. The closest is InvertedFieldState.getBoost() which
combines the document boost plus all field boosts plus some
normalization with the field length. So any more modifications appear to
be out of the question - if you know different please let me know.

In any case, the question is a bit academic at this point, we are
planning on multiplying the "docboost" into the SCORE values as they are
added into the index.

-sujit

On Wed, 2011-10-12 at 18:16 -0700, Sujit Pal wrote:
> Hi, 
> 
> Question about Payload Query and Document Boosts. We are using Lucene
> 3.2 and Payload queries, with our own PayloadSimilarity class which
> overrides the scorePayload method like so:
> 
> {code}
>   @Override
>   public float scorePayload(int docId, String fieldName,
>       int start, int end, byte[] payload, int offset, int length) {
>     if (payload != null) {
>       return PayloadHelper.decodeFloat(payload, offset);
>     } else {
>       return 1.0F;
>     }
>   }
> {/code}
> 
> We are injecting payloads as ID$SCORE pairs using the
> DelimitedPayloadTokenFilter and life was good - when we run
> PayloadTermQuery() the scores came back as our score. I have included
> code below that illustrates the calling pattern, its this:
> 
> {code}
>     PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
> "2790926"), new AveragePayloadFunction(), false);
> {/code}
> 
> ie, do not include the span score (the SCORE is calculated as a result
> of offline processing and we can't change that value).
> 
> Now we would like to boost each document differently (index time,
> document.setBoost(boost), based on its content type), and we are running
> into problems. Looks like the document boost is not applied to the
> document score during search if includeSpanScore==false. When we set it
> to true, we see a difference in scores (the original score without
> document boosts is multiplied by the document boost set), but the
> original scores without boost is not the same as SCORE, ie its now
> affected by the span score.
> 
> My question is - is there some method in DefaultSimilarity that I can
> override so that my score is my original SCORE * document boost? The
> Similarity documentation does not provide any clues to my problem - I
> tried modifying the computeNorm() method to return state.getBoost() but
> it looks like its never called.
> 
> If not, the other option would be to bake in the doc boost into the
> SCORE value, by multiplying them on their way into lucene, so that now
> SCORE *= doc boost.
> 
> Here is my unit test which illustrates the issue:
> 
> {code}
> import java.io.Reader;
> import java.util.HashMap;
> import java.util.Map;
> 
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.WhitespaceTokenizer;
> import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
> import org.apache.lucene.analysis.payloads.FloatEncoder;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.Field.Index;
> import org.apache.lucene.document.Field.Store;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.IndexWriterConfig.OpenMode;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.ScoreDoc;
> import org.apache.lucene.search.payloads.AveragePayloadFunction;
> import org.apache.lucene.search.payloads.PayloadTermQuery;
> import org.apache.lucene.store.RAMDirectory;
> import org.junit.Test;
> 
> import com.healthline.query.kb.ConceptAnalyzer;
> import com.healthline.solr.HlSolrConstants;
> import com.healthline.solr.search.PayloadSimilarity;
> import com.healthline.util.Config;
> 
> public class DocBoostTest {
> 
>   private class PayloadAnalyzer extends Analyzer {
>     @Override
>     public TokenStream tokenStream(String fieldName, Reader reader) {
>       TokenStream tokens = new
> WhitespaceTokenizer(HlSolrConstants.CURRENT_VERSION, reader);
>       tokens = new DelimitedPayloadTokenFilter(tokens, '$', new
> FloatEncoder());
>       return tokens;
>     }
>   };
> 
>   private Analyzer getAnalyzer() {
>     Map<String,Analyzer> pfas = new HashMap<String,Analyzer>();
>     pfas.put("imuids_p", new PayloadAnalyzer());
>     PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
>       new ConceptAnalyzer(), pfas);
>     return analyzer;
>   }
>   
>   private IndexSearcher loadTestData(boolean setBoosts) throws Exception
> {
>     RAMDirectory ramdir = new RAMDirectory();
>     IndexWriterConfig iwconf = new IndexWriterConfig(
>       HlSolrConstants.CURRENT_VERSION, getAnalyzer());
>     iwconf.setOpenMode(OpenMode.CREATE);
>     IndexWriter writer = new IndexWriter(ramdir, iwconf);
>     Document doc1 = new Document();
>     doc1.add(new Field("itemtitle", "Cancer and the Nervous System
> PARANEOPLASTIC DISORDERS", Store.YES, Index.ANALYZED));
>     doc1.add(new Field("imuids_p", "2790917$52.01 2790926$53.18",
> Store.YES, Index.ANALYZED));
>     doc1.add(new Field("contenttype", "BK", Store.YES,
> Index.NOT_ANALYZED));
>     if (setBoosts) doc1.setBoost(1.2F);
>     writer.addDocument(doc1);
>     Document doc2 = new Document();
>     doc2.add(new Field("itemtitle", "Esophagogastric cancer: Targeted
> agents", Store.YES, Index.ANALYZED));
>     doc2.add(new Field("imuids_p", "2790926$52.18 2790981$5.19",
> Store.YES, Index.ANALYZED));
>     doc2.add(new Field("contenttype", "JL", Store.YES,
> Index.NOT_ANALYZED));
>     if (setBoosts) doc2.setBoost(1.5F);
>     writer.addDocument(doc2);
>     writer.commit();
>     writer.close();
>     return new IndexSearcher(ramdir);
>   }
>   
>   @Test
>   public void testConceptScoringWithoutBoost() throws Exception {
>     Config.setConfigDir("/prod/web/config");
>     IndexSearcher searcher = loadTestData(false);
>     searcher.setSimilarity(new PayloadSimilarity());
>     PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
> "2790926"), 
>       new AveragePayloadFunction(), false);
>     ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
>     System.out.println("Concept result without boosting");
>     for (int i = 0; i < hits.length; i++) {
>       Document doc = searcher.doc(hits[i].doc);
>       String contentType = doc.get("contenttype");
>       String title = doc.get("itemtitle");
>       System.out.println(hits[i].doc + ": " + title + "/" + contentType
> + 
>         " (" + hits[i].score + ")");
>     }
>     searcher.close();
>   }
>   
>   @Test
>   public void testConceptScoringWithContentTypeBoost() throws Exception
> {
>     Config.setConfigDir("/prod/web/config");
>     IndexSearcher searcher = loadTestData(true);
>     searcher.setSimilarity(new PayloadSimilarity());
>     PayloadTermQuery q = new PayloadTermQuery(new Term("imuids_p",
> "2790926"), 
>       new AveragePayloadFunction(), false);
>     ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
>     System.out.println("Concept result with boosting");
>     for (int i = 0; i < hits.length; i++) {
>       Document doc = searcher.doc(hits[i].doc);
>       String contentType = doc.get("contenttype");
>       String title = doc.get("itemtitle");
>       System.out.println(hits[i].doc + ": " + title + "/" + contentType
> + 
>         " (" + hits[i].score + ")");
>     }
>     searcher.close();
>   }
>   
>   @Test
>   public void testFulltextScoringWithoutBoost() throws Exception {
>     Config.setConfigDir("/prod/web/config");
>     IndexSearcher searcher = loadTestData(false);
>     QueryParser parser = new
> QueryParser(HlSolrConstants.CURRENT_VERSION, 
>       "itemtitle", getAnalyzer());
>     Query q = parser.parse("cancer");
>     ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
>     System.out.println("Fulltext result without boosting");
>     for (int i = 0; i < hits.length; i++) {
>       Document doc = searcher.doc(hits[i].doc);
>       String contentType = doc.get("contenttype");
>       String title = doc.get("itemtitle");
>       System.out.println(hits[i].doc + ": " + title + "/" + contentType
> + 
>         " (" + hits[i].score + ")");
>     }
>     searcher.close();
>   }
>   
>   @Test
>   public void testFulltextScoringWithContentTypeBoost() throws Exception
> {
>     Config.setConfigDir("/prod/web/config");
>     IndexSearcher searcher = loadTestData(true);
>     QueryParser parser = new
> QueryParser(HlSolrConstants.CURRENT_VERSION, 
>       "itemtitle", getAnalyzer());
>     Query q = parser.parse("cancer");
>     ScoreDoc[] hits = searcher.search(q, 10).scoreDocs;
>     System.out.println("Fulltext result with boosting");
>     for (int i = 0; i < hits.length; i++) {
>       Document doc = searcher.doc(hits[i].doc);
>       String contentType = doc.get("contenttype");
>       String title = doc.get("itemtitle");
>       System.out.println(hits[i].doc + ": " + title + "/" + contentType
> + 
>         " (" + hits[i].score + ")");
>     }
>     searcher.close();
>   }
> }
> {/code}
> 
> With the includeSpanScore==false, I get the following results from this
> unit test. The scores are the same as what I put in, but document boost
> has no effect.
> 
> {code}
>     [junit] ------------- Standard Output ---------------
>     [junit] Concept result without boosting
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (53.18)
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18)
>     [junit] Concept result with boosting
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (53.18)
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (52.18)
>     [junit] Fulltext result without boosting
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674)
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (0.26010898)
>     [junit] Fulltext result with boosting
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011)
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (0.2972674)
>     [junit] ------------- ---------------- ---------------
> {/code}
> 
> and with includeSpanScores==true, I get the following results. This
> time, the doc boosts do affect the payload query scores, but the
> original scores (before boosting) is different from the score pair I put
> in.
> 
> {code}
>     [junit] Concept result without boosting
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (13.973032)
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (13.710282)
>     [junit] Concept result with boosting
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (21.936451)
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (16.767637)
>     [junit] Fulltext result without boosting
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.2972674)
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (0.26010898)
>     [junit] Fulltext result with boosting
>     [junit] 1: Esophagogastric cancer: Targeted agents/JL (0.4459011)
>     [junit] 0: Cancer and the Nervous System PARANEOPLASTIC DISORDERS/BK
> (0.2972674)
>     [junit] ------------- ---------------- ---------------
> {/code}
> 
> TIA for any help you can provide.
> 
> -sujit
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message