Subject Re: highlighting phrases
Date Wed, 01 Sep 2004 07:48:11 GMT
Adding support for phrases could be tricky.
So far I have deliberately avoided reimplementing specialized highlighting logic for each
of the different types of
queries eg understanding the nuances of "slop factor" in Phrase queries. I may be wrong but
adding specialized 
support for different query types just feels like the start of a slippery slope.

If people are keen to add such support though, here are some pointers to bear in mind...

Remember that the highlighter is also designed to summarize docs by selecting best fragments.
One decision to be made up front is to consider if a special "Fragmenter" implementation is
required that uses the
query to influence the way it breaks the doc into fragments ie. it ensures that matching words
in phrase queries 
or span queries remain in the same fragment.  

If phrases matches are allowed to span fragments thought needs to be given as to how the fragments
are scored.

Do phrases/spans get marked up with one tag eg <B>My Phrase</B> or many eg <B>My</B>
<B>Phrase</B> ?
I expect "many" is the answer given the possibility of other query terms appearing intermingled
in a  phrase with a 
high slop factor or a span.

The position of terms in the phrases will need to be known by the Formatter implementation
before attempting 
to mark up the text. This could/should be done using position info in the Lucene index rather
than requiring a separate
analyzer pass over the original text.

Most of this should be acheivable using specialized implementations of Formatter, Fragmenter
and Scorer so the main
Highlighter code should be untouched.

These are just some of the "gotchas" off the top of my head. I'm sure there will be several
more issues waiting to be revealed...
Hope this helps anyway.

