lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marvin Humphrey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Tue, 17 Mar 2009 13:32:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682672#action_12682672
] 

Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> Do you require term vectors to be stored, for highlighting (cannot
> re-analyze the text)?

Yes, but that's not fundamental to the design.  You just have to hand the
Weight some sort of single-doc index that includes sufficient data to
determine what parts of the text contributed to the hit and how much they
contributed.  The Weight needn't care whether that single-doc index was
created on the fly or stored at index time.

> For queries that normally do not use positions at all (simple AND/OR
> of terms), how does your highlightSpans() work?

ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
spans produced by their children.

> For BooleanQuery, is coord factor used to favor fragment sets that
> include more unique terms?

No; I don't think that would be fine grained enough to help.

There's a HeatMap class that performs additional weighting.  Spans that
cluster together tightly (i.e. that could fit together within the excerpt) are
boosted.

> Are you guaranteed to always present a net set of fragments that
> "matches" the query? (eg the example query above).

No.  The KS version supplies a single fragment.  It naturally prefers
fragments with rarer terms, because the span scores are multiplied by the
Weight's weighting factor (which includes IDF).  

Once that fragment is selected, the KS highlighter worries a lot about
trimming to sensible sentence boundaries.

In my own subjective judgment, supplying a single maximally coherent fragment
which prefers clusters of rare terms results in an excerpt which "scans" as
quickly as possible, conveying the gist of the content with minimal "visual
effort".  I used Google's excerpting as a model.

> I think the base litmus test for a hightlighter is: if one were to
> take all fragments presented for a document (call this a "fragdoc")
> and make a new document from it, would that document match the
> original query?

With out the aid of formal studies to guide us, this is a subjective call.
FWIW, I disagree.  In my view, visual scanning speed and coherence
are more important than completeness.  

I'm not a big fan of the multi-fragment approach, because I think it takes too
much effort to grok each individual entry.  Furthermore, the fact that the
fragments don't start on sentence boundaries (whenever feasible) adds to the
visual effort needed to orient yourself.

Search results contain a lot of junk.  The user needs to be able to parse the
results page as quickly as possible and refine their search query as needed.
Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
whole" impede that.  Trees vs. Forest.

Again, that's my own aesthetic judgment, but I'll wager that there are studies
out there showing that fragments which start at the top of a sentence are
easier to consume, and I think that's important.

> In fact, I think the perfect highlighter would "logically" work as
> follows: take a single document and enumerate every single possible
> fragdoc. 

KS uses a sliding window rather than chunking up the text into fragdocs of
fixed length.

> Having a whole separate package trying to reverse-engineer where matches had
> taken place between Query and Document is hard to get right.

Exactly.

PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
don't suppose you want to continue this on the Lucy dev list so that Lucy
banks some community credit for this discussion.  :\

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message