lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sebastian L. (Issue Comment Edited) (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments
Date Sat, 01 Oct 2011 12:57:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118784#comment-13118784
] 

sebastian L. edited comment on LUCENE-3440 at 10/1/11 12:56 PM:
----------------------------------------------------------------

Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.  

Another patch, another idea! :)

Some thoughts: 
- With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder
is used. 
- Also regardless of further calculations, FieldTermsStack retrieves document frequency for
each term from IndexReader in any case.
- Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring
for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's
the reason I started to work on this patch anyway.   

Possible Solution:

1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation 
- Introduction of TermInfo.fieldName
- Introduction of WeightedFragInfo.phraseInfos
- Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList()
in order to get the needed statistical data from the index

2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

{code}    
  /**
   * Compute WeightedFragInfo.score based on query-boosts
   * @throws IOException 
   */
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos,
IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        wfi.score += wpi.boost;
      }
    }
    return weightedFragInfos;
  }
{code}

3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

- In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
- But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass
of ScoreOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
 
- Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another
idea. 
- The _sum-of-distinct-weight_-approach is the same as presented in the last patch.

{code}
  /**
   * Compute WeightedFragInfo.score based on IDF-weighted terms
   * @throws IOException 
   */
  @Override
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos,
IndexReader reader ) throws IOException{
    
    Map<String, Float> lookup = new HashMap<String, Float>(); 
    HashSet<String> distinctTerms  = new HashSet<String>();
    
    int numDocs = reader.numDocs() - reader.numDeletedDocs();
    
    int docFreq;
    int length;
    float boost;
    float weight;
    
    for( WeightedFragInfo wfi : weightedFragInfos ){
      uniqueTerms.clear();
      length = 0;
      boost = 0;
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        for( TermInfo ti : wpi.termInfos ) {
          length++;
          if( !distinctTerms.add( ti.text ) ) 
            continue;
          if ( lookup.containsKey( ti.text ) )
            weight = lookup.get( ti.text ).floatValue();
          else {
            docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
            weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
            lookup.put( ti.text, new Float( weight ) );
          }
          boost += Math.pow( weight, 2 ) * wpi.boost;
        }
      }
      wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }
    
    return weightedFragInfos;
  }
{code}

With this approach programmers can implement their own fragments-weighting with ease, simply
overwriting calculateScore(). 

I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole
stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR,
this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would
be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin
could be parametrized with the intended implementation of FragList:

{code:xml}
<highlighter>
 <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder"
/>
  <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
 </fragmentsBuilder>
 <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder"
/>
  <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
 </fragmentsBuilder>
</highlighter>
{code}    

Further notes:
- As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
   
- "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
                
      was (Author: mdz-munich):
    Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 4.0-SNAPSHOT.  

Another patch, another idea! :)

Some thoughts: 
- With the last patch, sum-of-distinct-weights will be calculated anyhow, even if ScoreOrderFragmentsBuilder
is used. 
- Also regardless of further calculations, FieldTermsStack retrieves document frequency for
each term from IndexReader in any case.
- Solr-Developers have no chance to implement a FragmentsBuilder-plugin with their custom-scoring
for fragments, because the weighting-formula is "hard-coded" in WeightedFragInfo. BTW, that's
the reason I started to work on this patch anyway.   

Possible Solution:

1. Collect and pass all needed Informations to the BaseFragmentsBuilder-implementation 
- Introduction of TermInfo.fieldName
- Introduction of WeightedFragInfo.phraseInfos
- Passing a instance of IndexReader as argument to BaseFragmentsBuilder.getWeightedFragInfoList()
in order to get the needed statistical data from the index

2. Move the calculation of sum-of-boosts to ScoreOrderFramentsBuilder.calculateScore()

{code}    
  /**
   * Compute WeightedFragInfo.score based on query-boosts
   * @throws IOException 
   */
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos,
IndexReader reader ) throws IOException{
    for( WeightedFragInfo wfi : weightedFragInfos ){
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        wfi.score += wpi.boost;
      }
    }
    return weightedFragInfos;
  }
{code}

3. Calculation of sum-of-distinct-weights with WeightOrderFramentsBuilder.calculateScore()

- In this patch WeightOrderFramentsBuilder is a subclass of ScoreOrderFragmentsBuilder.
- But I think the introduction of an abstract class OrderedFragmentsBuilder as superclass
of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would be a better strategy.
 
- Moving calculateScore() into BaseFragmentsBuilder and making it abstract would be another
idea. 
- The _sum-of-distinct-weight_-approach is the same as presented in the last patch.

{code}
  /**
   * Compute WeightedFragInfo.score based on IDF-weighted terms
   * @throws IOException 
   */
  @Override
  public List<WeightedFragInfo> calculateScore( List<WeightedFragInfo> weightedFragInfos,
IndexReader reader ) throws IOException{
    
    Map<String, Float> lookup = new HashMap<String, Float>(); 
    HashSet<String> distinctTerms  = new HashSet<String>();
    
    int numDocs = reader.numDocs() - reader.numDeletedDocs();
    
    int docFreq;
    int length;
    float boost;
    float weight;
    
    for( WeightedFragInfo wfi : weightedFragInfos ){
      uniqueTerms.clear();
      length = 0;
      boost = 0;
      for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
        for( TermInfo ti : wpi.termInfos ) {
          length++;
          if( !distinctTerms.add( ti.text ) ) 
            continue;
          if ( lookup.containsKey( ti.text ) )
            weight = lookup.get( ti.text ).floatValue();
          else {
            docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
            weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) ) + 1.0 );
            lookup.put( ti.text, new Float( weight ) );
          }
          boost += Math.pow( weight, 2 ) * wpi.boost;
        }
      }
      wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
    }
    
    return weightedFragInfos;
  }
{code}

With this approach programmers can implement their own fragments-weighting with ease, simply
overwriting calculateScore(). 

I think, the major drawback of this idea is that the FragmentsBuilder must traverse the whole
stack of WeightedFragInfo once again. Since we have tomes with more than 3000 pages of OCR,
this _could_ be a problem. But I can't confirm that for sure. One way to avoid this would
be making FieldFragList "plugable" with an Interface "FragList" and the FragmentsBuilder-plugin
could be parametrized with the intended implementation of FragList:

{code:xml}
<highlighter>
 <fragmentsBuilder name="weight-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder"
/>
  <fragList class="org.apache.lucene.search.vectorhighlight.WeightedFragList" />
 </fragmentsBuilder>
 <fragmentsBuilder name="boost-ordered" class="org.apache.solr.highlight.OrderedFragmentsBuilder"
/>
  <fragList class="org.apache.lucene.search.vectorhighlight.BoostedFragList" />
 </fragmentsBuilder>
</highlighter>
{code}    

Further notes:
- As shown in this patch "WeightedFragInfo.totalBoost" should be renamed into "WeightedFragInfo.score".
   
- As shown in this patch "ScoreOrderFragmentsBuilder" should be renamed into "BoostOrderFragmentsBuilder".
                  
> FastVectorHighlighter: IDF-weighted terms for ordered fragments 
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3440
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3440
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 3.5, 4.0
>            Reporter: sebastian L.
>            Priority: Minor
>              Labels: FastVectorHighlighter
>             Fix For: 3.5, 4.0
>
>         Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, LUCENE-3.5-SNAPSHOT-3440-6.patch,
LUCENE-4.0-SNAPSHOT-3440-6.patch, WeightOrderFragmentsBuilder_table01.html, WeightOrderFragmentsBuilder_table02.html
>
>
> The FastVectorHighlighter uses for every term found in a fragment an equal weight, which
causes a higher ranking for fragments with a high number of words or, in the worst case, a
high number of very common words than fragments that contains *all* of the terms used in the
original query. 
> This patch provides ordered fragments with IDF-weighted terms: 
> total weight = total weight + IDF for unique term per fragment * boost of query; 
> The ranking-formula should be the same, or at least similar, to that one used in org.apache.lucene.search.highlight.QueryTermScorer.
> The patch is simple, but it works for us. 
> Some ideas:
> - A better approach would be moving the whole fragments-scoring into a separate class.
> - Switch scoring via parameter 
> - Exact phrases should be given a even better score, regardless if a phrase-query was
executed or not
> - edismax/dismax-parameters pf, ps and pf^boost should be observed and corresponding
fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message