lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Only highlight terms that caused a search hit/match
Date Sat, 15 Feb 2014 10:53:39 GMT
Unfortunately, all Lucene's highlighters are "approximate" in this
regard: there is no guarantee that the shown snippets, if they were a
single little document, would have matched the query.

Even the newest highlighter, PostingsHighlighter, doesn't look at
positions, e.g. a PhraseQuery highlight could be "wrong", though
"typically" the snippets with all terms from the phrase will scorer
higher and be more likely to be picked in practice.

Net/net I think a "precise highlighter", would be a nice addition to
Lucene, but it is a challenge because you need to turn every leaf
query into a positional query, even queries like TermQuery that
normally don't touch positions, and then you need to follow the query
tree while you highlight so that in your first example a OR (b AND z),
having picked a snippet or two for a, you then also go and pick a
snippet or two for the b AND z clause, and then present them both
together.

It's a hard problem but it would make a great addition.


Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids <sdavids@gmail.com> wrote:
> Hello,
>
> I have recently been given a requirement to improve document highlights within our system.
Unfortunately, the current functionality gives more of a best-guess on what terms to highlight
vs the actual terms to highlight that actually did perform the match. A couple examples of
issues that were found:
>
> Nested boolean clause with a term that doesn't exist ANDed with a term that does highlights
the ignored term in the query
> Text: a b c
> Logical Query: a OR (b AND z)
> Result: <b>a</b> <b>b</b> c
> Expected: <b>a</b> b c
> Nested span query doesn't maintain the proper positions and offsets
> Text: y z x y z a
> Logical Query: ("x y z", a) span near 10
> Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b>
<b>z</b> <b>a</b>
> Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b>
>
> I am currently using the Highlighter with a QueryScorer and a SimpleSpanFragmenter. While
looking through the code it looks like the entire query structure is dropped in the WeightedSpanTermExtractor
by just grabbing any positive TermQuery and flattening them all into a simple Map which is
then passed on to highlight all of those terms. I believe this over simplification of term
extraction is the crux of the issue and needs to be modified in order to produce more "exact"
highlights.
>
> I was brainstorming with a colleague and thought perhaps we can spin up a MemoryIndex
to index that one document and start performing a depth-first search of all queries within
the overall Lucene query graph. At that point we can start querying the MemoryIndex for leaf
queries and start walking back up the tree, pruning branches that don't result in a search
hit which results in a map of actual matched query terms. This approach seems pretty painful
but will hopefully produce better matches. I would like to see what the experts on the mailing
list would have to say about this approach or is there a better way to retrieve the query
terms & positions that produced the match? Or perhaps there is a different Highlighter
implementation that should be used, though our user queries are extremely complex with a lot
of nested queries of various types.
>
> Thanks,
>
> -Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message