lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Tue, 17 Mar 2009 19:08:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682762#action_12682762
] 

Michael Busch commented on LUCENE-1522:
---------------------------------------

I wrote the highlighter for the OmniFind Yahoo Edition a few years ago
and I totally agree that all this stuff is very subjective.

The OYE highlighter is of course based on Lucene and uses a sliding
window too. It also uses information about sentence boundaries and
prefers fragments that start at the beginning of a sentence.

So it goes through the document and generates fragment candidates on
the fly. It calculates a score for each fragment and puts it into a
priority queue. The score is calculated using different heuristics:
- fragments are boosted that start at the beginning of a sentence
- the more highlighted terms a fragment contains, the higher is it
scored
- more different highlighted terms scores higher than a lot of 
- occurrences of the same term
- no tf-idf is used
- if a fragment does not start at the beginning of a sentence, then it
is scored higher if the highlighted term(s) occur(s) more in the middle
of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a'
is the highlighted term; this is being done to show as much context as 
possible around a highlighted term
- only a single long fragment is created if it contains all query terms
(like google)
- The queue tries to gather fragments, so that the union of the fragments
contain as many different query terms as possible. So it might toss a
fragment in favor of one with a higher score, if it increases the
total number of different highlighted terms.
- For performance reasons there is an early termination if the
fragments in the queue contain all query terms.

Initially this highlighter also imitated Lucene's behavior to find the
highlighted positions. Last year I changed it to use SpanQueries. With
our flexible query parser (which I introduced on java-dev recently) we
have two different QueryBuilders. One creates the "normal" query, that
is executed to find the matching docs. Then a different QueryBuilder
creates SpanQueries from the same query for the highlighter.

The output of the highlighter is not formatted html, but rather an
object containing the unformatted text, together with offset
information for both fragments and highlights. These offset spans can
carry additional information, which can be used for multi-color
highlighting too. We then use an HTMLFormatter class to generate the
formatted text, also an XMLFormatter that keeps the offset information
separate from the actual text is possible (we're currently working on
such a XMLFormatter). This is useful for frontends written in e.g. Flex. 

The performance of our highlighter is good and so far we have been
pretty happy with the quality of the excerpts, but there is still much
room for improvements.

I'd be happy to help working on a new highlighter. I think this is a
very important component, and Lucene's core should have a very good
and flexible one.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message