lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Mon, 23 Mar 2009 21:09:50 GMT


Mark Miller commented on LUCENE-1522:

I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries
don't highlight correctly. That was/is a common occurrence and you can find tons of examples.
There are one or two JIRA highlighters that address it, and the their is the Span highlighter
(more interestingly, there is a link to the birth of the Span highlighter idea on that page
- thanks M. Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated complaints really.
While it would be nice to match boolean logic fully, I almost don't think its worth the effort.
You likely have an interest in those terms anyway - its not a given that the terms that caused
the match (non positional) matter. I have not seen a complaint on that one - mostly just positional
type stuff. And I think we have positional solved fairly well with the current API - its just
too darn slow. Not that I am against things being sweet and perfect, and getting exact matches,
but there has been lots of talk in the past about integrating the highlighter into core and
making things really fast and efficient - and generally it comes down to what work actually
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should *really* be done.
Most methods involved working with core - but what has been there for a couple years now is
the SpanScorer that plugs into the current highlighter API and nothing else has made any progress.
Not really an argument, just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of the day, but
its a tall order considering how much attention the Highlighter has managed to receive in
the past. Its large on ideas and low on sweat.

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message