lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Highlighting html pages
Date Mon, 05 Nov 2012 13:21:00 GMT
HTMLStripCharFilter runs first, before any tokenizer, strips all the tags, and leaves all your
text intact.  If you have angle brackets in the text (ie not tags), they will be left as is.
 All your other analysis code should work just the same as if the text came from a plain text
file.  Which tokenizer you want to use is up to you and has nothing to do with the CharFilter.


On 11/1/2012 9:16 PM, Scott Smith wrote:

> I was trying to play with this.  Am I correct in assuming that this isn't going to work
with the StandardTokenizer (since it appears to strip angle brackets among other things)?
 Does HTMLStripCharFilter expect a WhiteSpaceTokenizer or a CharacterTokenizer or ??
> If I want to get rid of punctuation (commas, periods, semicolons, etc.) after the HTML
stripping, is there a filter?  Essentially, I want to get it back to what StandardTokenizer
would give me after I've stripped the HTML.
> Suggestions?
> Scott
> -----Original Message-----
> From: Michael Sokolov []
> Sent: Tuesday, October 23, 2012 9:04 PM
> To:
> Cc: Scott Smith
> Subject: Re: Highlighting html pages
> If you use HTMLStripCharFilter, it extracts the text only, leaving tags out, and remembering
the word positions so that highlighting works properly.  Should do exactly what you want out
of the box...
> On 10/23/2012 8:00 PM, Scott Smith wrote:
>> I need to take an html page  that I retrieve from my lucene search and highlight
all of the terms that are part of the search.  I need to skip over any html tags since I don't
want any words in tags which happen to match the search to be highlighted.
>> Note that I don't want sections of the document.  I need to highlight all terms in
the document (with a <span> or something similar) and get back the entire document (with
the new <span>s) so it can be displayed in its entirety with the search terms highlighted.
>> Last time I did this (in the days of 1.4.2 - so a while ago), I had to write a custom
tokenizer that skipped over the html tokens so that I didn't accidentally highlight them.
 I'm hoping that there is an easier way to do this now.
>> Suggestions?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message