lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Mon, 16 Mar 2009 22:57:50 GMT


Mark Miller commented on LUCENE-1522:

{quote}Is the reason why H1 creates the full token stream (even when
TermVectors is the source) in order to build the MemoryIndex?

If term vectors (w/ positions, offsets) were stored, wouldn't it be
possible to make a simple index (or at least TermDocs, TermPositions)
wrapped on those TermVectors? {quote}

It creates the full tokenstream because it was designed to work without termvectors, and so
without offset info for the query terms, it rebuilds the stream and processes a token at a
time - the api gives you hooks to highlight at any of these tokens - thats essentially the
bottleneck I think - taking everything a token at a time, but the whole API is based on that
fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it
was convenient to fit into the current highlighter API to start. I had it in my mind to break
from the API and make a largedoc highlighter that didn't need termvectors, but I found the
memory index and getspans to still be too slow in my initial testing. I'd hoped to work more
on it, but havn't had a chance. So essentially, while more can be done with termvectors, the
improvements break the current API at a pretty deep level - no one has done the work to solve
that I guess - which is why we have the alternate highlighters.

> another highlighter
> -------------------
>                 Key: LUCENE-1522
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message