lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Highlighting html pages
Date Wed, 24 Oct 2012 03:04:09 GMT
If you use HTMLStripCharFilter, it extracts the text only, leaving tags 
out, and remembering the word positions so that highlighting works 
properly.  Should do exactly what you want out of the box...

On 10/23/2012 8:00 PM, Scott Smith wrote:
> I need to take an html page  that I retrieve from my lucene search and highlight all
of the terms that are part of the search.  I need to skip over any html tags since I don't
want any words in tags which happen to match the search to be highlighted.
> Note that I don't want sections of the document.  I need to highlight all terms in the
document (with a <span> or something similar) and get back the entire document (with
the new <span>s) so it can be displayed in its entirety with the search terms highlighted.
> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write a custom
tokenizer that skipped over the html tokens so that I didn't accidentally highlight them.
 I'm hoping that there is an easier way to do this now.
> Suggestions?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message