lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Lutze (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-4133) FastVectorHighlighter: A weighted approach for ordered fragments
Date Mon, 11 Jun 2012 16:13:42 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Lutze updated LUCENE-4133:
------------------------------------

    Description: 
The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated
fragments. In the worst case, a fragment, which contains high number of very common words,
is scored higher, than a fragment that contains *all* of the terms which have been used in
the original query.

This patch provides ordered fragments with IDF-weighted terms:

*For each distinct matching term per fragment:* 
_weight = weight + IDF * boost_

*For each fragment:* 
_weight = weight * length * 1 / sqrt( length )_

|weight| total weight of fragment 
|IDF| inverse document frequency for each distinct matching term
|boost| query boost as provided, for example _term^2_
|length| total number of non-distinct matching terms per fragment 


*Method:*

{code:java}
  public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList
) {
    
    float totalBoost = 0;
    
    List<SubInfo> subInfos = new ArrayList<SubInfo>();
    HashSet<String> distinctTerms = new HashSet<String>();
    
    int length = 0;

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
      subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum()
) );
      for ( TermInfo ti :  phraseInfo.getTermsInfos()) {
        if ( distinctTerms.add( ti.getText() ) )
          totalBoost += ti.getWeight() * phraseInfo.getBoost();
        length++;
      }
    }
    totalBoost *= length * ( 1 / Math.sqrt( length ) );
    
    getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost
) );
  }
{code}

The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.

*This patch contains:*

* a changed class-member in FieldPhraseList (termInfos to termsInfos)
* a changed local variable in SimpleFieldFragList (score to totalBoost)
* adds a missing @override in SimpleFragListBuilder
* class WeightedFieldFragList, a implementation of FieldFragList
* class WeightedFragListBuilder, a implementation of BaseFragListBuilder
* class WeightedFragListBuilderTest, a simple test-case 
* updated docs for FVH 

Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440. 


  was:
The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated
fragments. In the worst case, a fragment, which contains high number of very common words,
is scored higher, than a fragment that contains *all* of the terms which have been used in
the original query.

This patch provides ordered fragments with IDF-weighted terms:

*For each distinct matching term per fragment:* 
_weight = weight + IDF * boost_

*For each fragment:* 
_weight = weight * numTerms * 1 / sqrt( numTerms )_

|weight| total weight of fragment 
|IDF| inverse document frequency for each distinct matching term
|boost| query boost as provided, for example _term^2_
|numTerms| total number of matching terms per fragment 


*Method:*

{code:java}
  public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList
) {
    
    float totalBoost = 0;
    
    List<SubInfo> subInfos = new ArrayList<SubInfo>();
    HashSet<String> distinctTerms = new HashSet<String>();
    
    int length = 0;

    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
      subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum()
) );
      for ( TermInfo ti :  phraseInfo.getTermsInfos()) {
        if ( distinctTerms.add( ti.getText() ) )
          totalBoost += ti.getWeight() * phraseInfo.getBoost();
        length++;
      }
    }
    totalBoost *= length * ( 1 / Math.sqrt( length ) );
    
    getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost
) );
  }
{code}

The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.

*This patch contains:*

* a changed class-member in FieldPhraseList (termInfos to termsInfos)
* a changed local variable in SimpleFieldFragList (score to totalBoost)
* adds a missing @override in SimpleFragListBuilder
* class WeightedFieldFragList, a implementation of FieldFragList
* class WeightedFragListBuilder, a implementation of BaseFragListBuilder
* class WeightedFragListBuilderTest, a simple test-case 
* updated docs for FVH 

Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440. 


    
> FastVectorHighlighter: A weighted approach for ordered fragments
> ----------------------------------------------------------------
>
>                 Key: LUCENE-4133
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4133
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 4.0, 5.0
>            Reporter: Sebastian Lutze
>            Priority: Minor
>              Labels: FastVectorHighlighter
>             Fix For: 4.0
>
>         Attachments: LUCENE-4133.patch
>
>
> The FastVectorHighlighter currently disregards IDF-weights for matching terms within
generated fragments. In the worst case, a fragment, which contains high number of very common
words, is scored higher, than a fragment that contains *all* of the terms which have been
used in the original query.
> This patch provides ordered fragments with IDF-weighted terms:
> *For each distinct matching term per fragment:* 
> _weight = weight + IDF * boost_
> *For each fragment:* 
> _weight = weight * length * 1 / sqrt( length )_
> |weight| total weight of fragment 
> |IDF| inverse document frequency for each distinct matching term
> |boost| query boost as provided, for example _term^2_
> |length| total number of non-distinct matching terms per fragment 
> *Method:*
> {code:java}
>   public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList
) {
>     
>     float totalBoost = 0;
>     
>     List<SubInfo> subInfos = new ArrayList<SubInfo>();
>     HashSet<String> distinctTerms = new HashSet<String>();
>     
>     int length = 0;
>     for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
>       subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(),
phraseInfo.getSeqnum() ) );
>       for ( TermInfo ti :  phraseInfo.getTermsInfos()) {
>         if ( distinctTerms.add( ti.getText() ) )
>           totalBoost += ti.getWeight() * phraseInfo.getBoost();
>         length++;
>       }
>     }
>     totalBoost *= length * ( 1 / Math.sqrt( length ) );
>     
>     getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost
) );
>   }
> {code}
> The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.
> *This patch contains:*
> * a changed class-member in FieldPhraseList (termInfos to termsInfos)
> * a changed local variable in SimpleFieldFragList (score to totalBoost)
> * adds a missing @override in SimpleFragListBuilder
> * class WeightedFieldFragList, a implementation of FieldFragList
> * class WeightedFragListBuilder, a implementation of BaseFragListBuilder
> * class WeightedFragListBuilderTest, a simple test-case 
> * updated docs for FVH 
> Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message