lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Wed, 18 Mar 2009 15:51:50 GMT


Michael McCandless commented on LUCENE-1522:

Something like that. An array of span scores is too limited; a full fledged
class would do better. Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.

Agreed, and I'm not sure about the tree structure (just floating
ideas...).  It could very well be overkill.

By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.

It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself
will have returned that.  Ie we would add a method to

Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So "bush bush" might be prefered over "president bush", 
but "bush or bush" proabably wouldn't.

OK, it sounds like one can simply use different models to score
fragdocs and it's still an open debate how much each of these criteria
(IDF, showing surround context, being on sentence boundary, diversity
of terms) should impact the score.  I agree, the "basic litmus test" I
proposed is too strong.

bq. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries.

Thats not necessarily a property of the Highlighter, just the basic
implementations we currently supply for the pluggable classes. You can
supply a custom fragmenter and you can control the number of

I agree: H1 is very pluggable and one could plug in a better
fragmenter, but we don't offer such an impl in H1, and this is a case
where "out-of-the-box defaults" are very important.

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message