lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Thinking about better highlighting
Date Thu, 25 Aug 2005 07:35:55 GMT
On Thursday 25 August 2005 04:47, Fred Toth wrote:
> All,
> 
> First, my thanks to those who've contributed to the current
> best practices for highlighting. We use your code!
> 
> However, after reviewing recent discussions about highlighting,
> and struggling with our own highlighting issues, I'm wondering if
> there's a better way.
> 
> Others have certainly thought more about this (but I've thought
> about it a lot).
> 
> Isn't it true that the fundamental problem is that all of the highlighting
> approaches are struggling with trying to recreate what the lucene core
> has already done at search time?

Because storing those results takes memory, and most of these results
would not be needed lateron. In Lucene only the score is kept during
search, and then only when it is high enough.
One could extend the search core to keep only the highlighting info of
these higher scoring docs, but that would slow down searching.

> 
> My simplest example is a phrase query, "brown fox". Why should we
> have to attempt to simulate what lucene does in the highlighting code?
> There are several attempts out there to solve this using various approaches,
> span queries, custom hacks, etc., but all suffer from the same problem.
> Namely, it's a lot of difficult work to correctly find the same terms 
> in the highlighting
> code that lucene has already found moments before. So we end up
> highlighting "brown" and "fox" wherever they occur, not just the phrase.
> 
> I read with interest the recent discussion of using span queries to search
> a single document to determine phrases, taking into account slop factor, 
etc.
...

Getting PhraseQuery to work as a SpanQuery and as efficiently as it works
now will not be straightforward but it might be possible.
The NearSpansOrdered posted here might be a good starting point:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35823

One approach could be to redo the search, but limited to
the documents to be highlighted, and only gathering the highlight
positions from the Spans during that redone search.
This can be fast when the search was just done and most index info
needed is still cached by the operating system, and has the
advantage that the highlights will be the same as the ones used to
compute the document scores.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message