lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Tue, 17 Mar 2009 20:05:50 GMT


Marvin Humphrey commented on LUCENE-1522:

>> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
>> spans produced by their children.
> Hmm - it seems like that loses information.  Ie, for ANDQuery, you lose the 
> fact that you should try to include a match from each of the sub-clauses' spans.

A good idea.  ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account.  That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.

> What I meant was: all other things being equal, do you more strongly
> favor a fragment that has all N of the terms in a query vs another
> fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in.  The span 
objects only tell the Highlighter that a particular range of characters 
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on "the" from causing too
hot a hotspot in the heat map.  So you're likely to see fragments with high
discriminatory value.

> Google picks more than one fragment; it seems like it picks one or two
> fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment.  It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.  

What bothers me is that the excerpts don't scan easily right now.  I consider
that a much more important defect than the fact that the fragdoc doesn't hit 
every term (which isn't even possible for large queries), and it seemed to me 
that pursuing exhaustive term matching was likely to yield even more highly 
fragmented, visually chaotic fragdocs.  

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message