From Fred Toth <>
Subject Re: Thinking about better highlighting
Date Thu, 25 Aug 2005 15:22:05 GMT
Based on this discussion, I've gone back and re-read everything
in LIA on SpanQuery, etc.

Isn't this just another manifestation of the same problem? How
do I reliably, correctly convert an arbitrary lucene query into its
equivalent SpanQuery?

Here's one, for example:

+text:"jurassic barnea" +author:zofer +year:[1987 TO 1987]

As you can imagine, ideally I would like to display the document with
the phrase highlighted, the author name highlighted and the year

Am I correct that there is no simple mechanism to get from the
above "standard" lucene query to a SpanQuery that can give me
the offsets of the terms that actually matched? I'm forced to
pick apart the query, essentially reparse it with a different methodology
to get "close" to what lucene has already done?

Even if I could reliably convert a standard phrase query to a
SpanQuery, that's just the tip of the iceberg, right? What about prefix
queries, complex booleans, etc. Is this a slippery slope?

Isn't it true that lucene has already identified (somewhere) exactly which
occurrences of "jurassic" and "barnea" caused the phrase match?
I like the idea of reindexing and requerying the matched ducument at
highlight time, but I'm still lost on how to convert everything to SpanQuery

Or am I missing something here (always a distinct possibility)?



At 05:11 AM 8/25/2005, you wrote:
>Unfortunately I've not had the time to address the
>phrase highlighting issues in the current highlighter
>but I think I've an idea as to how best to fix it:
>I would suggest rewriting the highlighter to use Spans
>not Terms to find the relevant sections in a text.
>Most of the code required for such a solution is
>around in one form or another but needs bringing
>* The SpansExtractor class here:
>This can be used to get Spans for a given query and
>IndexReader to show where all query hits for a
>document lie.
>* The contrib section includes a MemoryIndex that can
>provide a fast IndexReader for a single document
>(faster than using RAMDirectory).
>* The LuceneInAction code example for SpanQueries
>includes a rudimentary highlighter that uses Spans to
>control where markup is introduced given a collection
>of Spans (This does not attempt to summarise long docs
>The overall approach using this code would be to index
>each doc to be highlighted in MemoryIndex, run
>SpansExtractor using the (rewritten) user query and
>the MemoryIndex's IndexReader, give the resulting
>spans to an adapted LIA highlighter/summariser.
>Some issues with this:
>1) The contrib section would now have inter-project
>dependencies (highlighter -> MemIndex) which would
>need to be catered for in the Ant build process.
>2) We may need to think about how we factor in IDF
>weighting of individual terms to the summarising
>process so that the more important terms influence the
>selection of highlights.
>Does this sound reasonable?
