lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From markharw00d <>
Subject Re: Highlighting text for queries with huge numbers of terms
Date Fri, 17 Feb 2006 08:03:12 GMT
Hi Daniel/Chris,

> Unfortunately, the contrib/highlighter code in source control fails to 
> meet our needs in two ways:
>   1. We don't just want fragments, we want *all* of the text, with
>      highlights in the appropriate places (although we do offer a means
>      to display just the fragments as well), and

Pass a "NullFragmenter" to the highlighter constructor to turn off 

>   2. We don't deal with HTML, just plain text on a Swing text component.
>      In other words we don't have to "format" or modify the text at all,
>      just tell the Swing component which bits need to be highlighted.

Swing supports HTML and will do the highlight for you.

If you don't like that approach and really do just want to just know the 
positions, plug in your own "Formatter" class which, instead of marking 
up the text, silently records the hit position information provided to 
it in the "TokenGroup" class and then return the original string without 
adding any markup. TokenGroup handles the issue of identifying runs of 
overlapping tokens for you.

Hoss, your psuedo code looked like a solution for identifying phrase 
queries.  Lack of proper support for phrase queries is a known issue 
with the current highlighter but I thought the primary issue in question 
here was speed? The approach taken by the current highlighter is to 
maintain a HashSet of all unique query terms and check each token in the 
text's token stream for a hit on this set. As your code suggests, this 
could be made faster if there were multiple queries all of which were 
PhraseQueries (with no slop factor!) because you would only need to 
check each phrase's "first terms"  initially. Not sure this helps for 
non-phrase queries. Also, I don't think hitting the index to work out 
what terms were a hit for the doc in question in order to shorten the 
list of terms to highlight  is likely to speed up things. If anything, 
the extra disk IO is likely to slow it down.
With regards to the quesiton of overlapping tokens - the highlighter is 
robust in the face of marking these up.


Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message