lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Sun, 15 Mar 2009 00:58:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682111#action_12682111
] 

Koji Sekiguchi commented on LUCENE-1522:
----------------------------------------

Mike, I'm sorry for late reply.

bq. Is this approach guaranteed to only highlight term occurrences that actually contribute
to the document match?

I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch
feature in Solr", YES. highlighter2 has the feature:

{code:java}
/**
 * a constructor. A FragListBuilder and a FragmentsBuilder can be specified (plugins).
 * 
 * @param phraseHighlight true of false for phrase highlighting
 * @param fieldMatch true of false for field matching
 * @param fragListBuilder an instance of FragListBuilder
 * @param fragmentsBuilder an instance of FragmentsBuilder
 */
public Highlighter( boolean phraseHighlight, boolean fieldMatch, FragListBuilder fragListBuilder,
FragmentsBuilder fragmentsBuilder ){
  this.phraseHighlight = phraseHighlight;
  this.fieldMatch = fieldMatch;
  this.fragListBuilder = fragListBuilder;
  this.fragmentsBuilder = fragmentsBuilder;
}
{code}

bq. Can it handle all / arbitrary Query subclasses?

Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning.
In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains
TermQuery and PhraseQuery:

{code:title=FieldQuery.java}
void flatten( Query sourceQuery, Collection<Query> flatQueries ){
  if( sourceQuery instanceof BooleanQuery ){
    BooleanQuery bq = (BooleanQuery)sourceQuery;
    for( BooleanClause clause : bq.getClauses() ){
      if( !clause.isProhibited() )
        flatten( clause.getQuery(), flatQueries );
    }
  }
  else if( sourceQuery instanceof TermQuery ){
    if( !flatQueries.contains( sourceQuery ) )
      flatQueries.add( sourceQuery );
  }
  else if( sourceQuery instanceof PhraseQuery ){
    if( !flatQueries.contains( sourceQuery ) ){
      PhraseQuery pq = (PhraseQuery)sourceQuery;
      if( pq.getTerms().length > 1 )
        flatQueries.add( pq );
      else if( pq.getTerms().length == 1 ){
        flatQueries.add( new TermQuery( pq.getTerms()[0] ) );
      }
    }
  }
  // else discard queries
}
{code}

But I'm always positive to support all / arbitrary Query subclasses in H2. :)

bq. How does it score fragments?

Currently, H2 takes into account query time boost and tf in fragment. For example, if we have
q="a OR b^3" and two fragment candidates f1="a a a" and f2="a b", f1 gets 3 and f2 gets 4,
getBestFragments() will return f2 first, then f1 when ScoreOrderFragmentsBuilder (default)
is used.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message