lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Tue, 17 Mar 2009 09:30:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682609#action_12682609
] 

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
> It'd be sort of like a positional-aware "explain", ie "show me the term
> occurrences that allowed the full query to accept this document".

FWIW, this is more or less how the KinoSearch highlighter now works in svn
trunk. It doesn't use a Scorer, though, but instead the KS analogue to
Lucene's "Weight" class.

The (Weight) is fed what is essentially a single doc index, using stored term
vectors. Weight.highlightSpans() returns an array of "span" objects, each of 
which has a start offset, a length, and a score. The Highlighter then 
processes these span objects to create a "heat map" and choose its excerpt 
points.

The idea is that by delegating responsibility for creating the scoring spans, we
make it easier to support arbitrary Query implementations with a single
Highlighter class.
{quote}

Awesome!

Do you require term vectors to be stored, for highlighting (cannot
re-analyze the text)?

For queries that normally do not use positions at all (simple AND/OR
of terms), how does your highlightSpans() work?

For BooleanQuery, is coord factor used to favor fragment sets that
include more unique terms?

Are you guaranteed to always present a net set of fragments that
"matches" the query? (eg the example query above).

I think the base litmus test for a hightlighter is: if one were to
take all fragments presented for a document (call this a "fragdoc")
and make a new document from it, would that document match the
original query?

In fact, I think the perfect highlighter would "logically" work as
follows: take a single document and enumerate every single possible
fragdoc.  Each fragdoc is allowed to have maxNumFragments fragments,
where each fragment has a min/max number of characters.  The set of
fragdocs is of course ridiculously immense.

Take this massive collection of fragdocs and build a new temporary
index, then run your Query against that index.  Many of the fragdocs
would not match the Query, so they are eliminated right off (this is
the litmus test).  Then, of the ones that do, you want the highest
scoring fragdocs.

Obviously you can't actually implement a highlighter like that, but I
think "logically" that is the optimal highlighter that we are trying
to emulate with more efficient implementations.

I think having the Query/Weight/Scorer class be the single-source for
hits, explanation & highlight spans is the right approach.  Having a
whole separate package trying to reverse-engineer where matches had
taken place between Query and Document is hard to get right.  EG
BooleanScorer2's coord factor would naturally/correctly influence the
selection.

I also think building a [reduced, just Postings] IndexReader API on top of
TermVectors ought to be a simple way to get great performance here.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message