lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Mon, 16 Mar 2009 10:37:50 GMT


Michael McCandless commented on LUCENE-1522:

bq. I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch
feature in Solr", YES. highlighter2 has the feature:

Actually I was asking whether every fragment that's returned is
guaranteed to show a match to my original query.

EG if my query is a PhraseQuery, is it guaranteed that all fragments
presented are valid matches?  If I search for "Alan Greenspan's
mortgage", is it ever possible to see a fragment that contains only
"Alan Greenspan"?

bq. Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the
beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery
that contains TermQuery and PhraseQuery:

OK so eg *SpanQuery won't work?  It seems like both highlighters take
this "flatten" approach, which can lose the constraints for
interesting queries (like Span, or a custom query).

I think a nice [eventual] model would be if we could simply re-run the
scorer on the single document (using InstantiatedIndex maybe, or
simply some sort of wrapper on the term vectors which are already a
mini-inverted-index for a single doc), but extend the scorer API to
tell us the exact term occurrences that participated in a match (which
I don't think is exposed today).

EG ExactPhraseScorere.phraseFreq has the logic to check term positions
and find all positions where the phrase matches.  Right now that
method throws away the specific position where each match occurred,
but if instead we had it call a normally no-op method
(recordDocMatchPosition(int position, float score) or some such), we
could then make use of it for highlighting.

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message