lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Toth <>
Subject Thinking about better highlighting
Date Thu, 25 Aug 2005 02:47:02 GMT

First, my thanks to those who've contributed to the current
best practices for highlighting. We use your code!

However, after reviewing recent discussions about highlighting,
and struggling with our own highlighting issues, I'm wondering if
there's a better way.

Others have certainly thought more about this (but I've thought
about it a lot).

Isn't it true that the fundamental problem is that all of the highlighting
approaches are struggling with trying to recreate what the lucene core
has already done at search time?

My simplest example is a phrase query, "brown fox". Why should we
have to attempt to simulate what lucene does in the highlighting code?
There are several attempts out there to solve this using various approaches,
span queries, custom hacks, etc., but all suffer from the same problem.
Namely, it's a lot of difficult work to correctly find the same terms 
in the highlighting
code that lucene has already found moments before. So we end up
highlighting "brown" and "fox" wherever they occur, not just the phrase.

I read with interest the recent discussion of using span queries to search
a single document to determine phrases, taking into account slop factor, etc.
Again, why all this work when lucene has already found specific terms
in the document that matched a specific query (however complex, involving
phrases or prefixes or slop factors, etc.)?

I'm not familiar with core lucene, but is it possible that lucene could return
additional information about specific terms matched? This could be optional
so that those not using it would incur no performance hits. The matched terms
could contain offsets or other information stored at indexing time.

No matter how complex a lucene query, when core lucene returns a hit, it must
have determined exactly which terms satisfy the query, right? So why do we
have to redo the work? (often incorrectly)

Again, I freely admit my ignorance of lucene core. This may be architecturally
impractical for all I know.

But, maybe not! Have others considered this? Please discuss.

Verity has solved this, by the way. It correctly highlights phrases 
without highlighting
additional occurrences of the phrase terms.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message