lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Wed, 18 Mar 2009 14:27:50 GMT


Marvin Humphrey commented on LUCENE-1522:

> I think we may need a tree-structured result returned by the
> Weight/Scorer, compactly representing the "space" of valid fragdocs
> for this one doc. And then somehow we walk that tree,
> enumerating/scoring individual "valid" fragdocs that are created from
> that tree.

Something like that.  An array of span scores is too limited; a full fledged
class would do better.  Designing that class requires striking a balance
between what information we think is useful and what information Highlighter
can sanely reduce.  By proposing the tree structure, you're suggesting that 
Highlighter will reverse engineer boolean matching; that sounds like a lot of 
work to me.  

>> However, note that IDF would prevent a bunch of hits on "the" from causing too
>> hot a hotspot in the heat map. So you're likely to see fragments with high
>> discriminatory value.
> This still seems subjectively wrong to me. If I search for "president
> bush", probably bush is the rarer term and so you would favor showing
> me a single fragment that had bush occur twice, over a fragment that
> had a single occurrence of president and bush?

We've ended up in a false dichotomy.  Favoring high IDF terms -- or more
accurately, high scoring character position spans -- and favoring fragments 
with high term diversity are not mutually exclusive.  

Still, the KS highlighter probably wouldn't do what you describe.  The proximity
boosting accelerates as the spans approach each other, and maxes out if 
they're adjacent.  So "bush bush" might be prefered over "president bush", 
but "bush or bush" proabably wouldn't.

I don't think that there's anything wrong with preferring high term diversity;
the KS highlighter doesn't happen to support favoring fragments with high term
diversity now, but would be improved by adding that capability.  I just don't
think term diversity is so important that it qualifies as a "base litmus

There are other ways of choosing good fragments, and IDF is one of them.  If
you want to show why a doc matched a query, it makes sense to show the section
of the document that contributed most to the score, surrounded by a little

> Which excerpts don't scan easily right now? Google's, KS's, Lucene's
> H1 or H2?

Lucene H1.  Too many elipses, and fragments don't prefer to start on sentence

I have to qualify the assertion that the fragments don't scan well with the caveat 
that I'm basing this on a personal impression.  However, I'm pretty confident 
about that impression.  I would be stunned if there were not studies out there
demonstrating that sentence fragments which begin at the top are easier to
consume than sentence fragments which begin in the middle.

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message