lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1522) another highlighter
Date Tue, 17 Mar 2009 15:04:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682689#action_12682689
] 

Michael McCandless commented on LUCENE-1522:
--------------------------------------------

{quote}
> Do you require term vectors to be stored, for highlighting (cannot
> re-analyze the text)?

Yes, but that's not fundamental to the design. You just have to hand the
Weight some sort of single-doc index that includes sufficient data to
determine what parts of the text contributed to the hit and how much they
contributed. The Weight needn't care whether that single-doc index was
created on the fly or stored at index time.
{quote}

OK.

{quote}
> For queries that normally do not use positions at all (simple AND/OR
> of terms), how does your highlightSpans() work?

ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
spans produced by their children.
{quote}

Hmm -- it seems like that loses information.  Ie, for ANDQuery, you
lose the fact that you should try to include a match from each of the
sub-clauses' spans.

{quote}
> For BooleanQuery, is coord factor used to favor fragment sets that
> include more unique terms?

No; I don't think that would be fine grained enough to help.
{quote}

What I meant was: all other things being equal, do you more strongly
favor a fragment that has all N of the terms in a query vs another
fragment that has fewer than N but say higher net number of
occurrences.

{quote}
There's a HeatMap class that performs additional weighting. Spans that
cluster together tightly (i.e. that could fit together within the excerpt) are
boosted.
{quote}

That sounds great.

{quote}
> Are you guaranteed to always present a net set of fragments that
> "matches" the query? (eg the example query above).

No. The KS version supplies a single fragment. It naturally prefers
fragments with rarer terms, because the span scores are multiplied by the
Weight's weighting factor (which includes IDF).
{quote}

Hmm OK.

{quote}
Once that fragment is selected, the KS highlighter worries a lot about
trimming to sensible sentence boundaries.
{quote}

I totally agree: easy/fast consumability is very important, so
choosing entire sentences, or at least anchoring the start or maybe
end on a sentence boundary, is important.  Lucene's H1 doesn't do this
ootb today I think (though you could plug in your own fragmenter).

{quote}
In my own subjective judgment, supplying a single maximally coherent fragment
which prefers clusters of rare terms results in an excerpt which "scans" as
quickly as possible, conveying the gist of the content with minimal "visual
effort". I used Google's excerpting as a model.
{quote}

Google picks more than one fragment; it seems like it picks one or two
fragments.

I'm torn on whether IDF should really come into play though...

{quote}
> I think the base litmus test for a hightlighter is: if one were to
> take all fragments presented for a document (call this a "fragdoc")
> and make a new document from it, would that document match the
> original query?

With out the aid of formal studies to guide us, this is a subjective call.
FWIW, I disagree. In my view, visual scanning speed and coherence
are more important than completeness.

I'm not a big fan of the multi-fragment approach, because I think it takes too
much effort to grok each individual entry. Furthermore, the fact that the
fragments don't start on sentence boundaries (whenever feasible) adds to the
visual effort needed to orient yourself.

Search results contain a lot of junk. The user needs to be able to parse the
results page as quickly as possible and refine their search query as needed.
Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
whole" impede that. Trees vs. Forest.

Again, that's my own aesthetic judgment, but I'll wager that there are studies
out there showing that fragments which start at the top of a sentence are
easier to consume, and I think that's important.
{quote}

I agree, it's not cut and dry here; this is all quite subjective.

I think one case that's tricky is two terms that do not tend do
co-occur in proximity.  Eg search for python greenspan on Google, and
most of the fragdocs consist of two fragments, one for each term.  Ie
google is trying to include all the terms in the fragdoc (my "coord
factor" question above).

{quote}
> In fact, I think the perfect highlighter would "logically" work as
> follows: take a single document and enumerate every single possible
> fragdoc.

KS uses a sliding window rather than chunking up the text into fragdocs of
fixed length.
{quote}

Or, the allowed length of each fragment could span a specified min/max
range.

And I like the sliding window approach instead of the pre-fragment
approach.

(Note: a fragdoc is one or more fragments stuck together, ie, the
entire excerpt.)


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general
token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea
was inherited from my previous project with my colleague and LUCENE-644. This approach needs
highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams.
This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100,
3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g.
(2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message