lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <>
Subject Re: Highlighting html pages
Date Tue, 06 Nov 2012 12:18:45 GMT
On 11/6/2012 3:29 AM, Steve Rowe wrote:
> Hi Scott,
> HTMLStripCharFilter doesn't require that its input be valid HTML - there is no assumption
of balanced tags.
> Also, highlighted sections could span tags, e.g. if you highlight "this phrase", and
the original HTML looks like:
> 	… this<span>phrase</span> …
> the highlighting code would have to know to put multiple tags to avoid non-wellformedness,
maybe something like:
> 	… <b>this</b><span><b>phrase</b></span> …
> If you do develop a solution here, it would be great if you could share it with the community.
> Also, I think it would be useful to have an XML-specific stripping char filter - it's
on my long term to-do list :).
Steve: see I have 
updates for this, but since no committers took it up, I haven't bothered 
to keep the issue up to date with my latest code.

I would also love to see a tag-balancer for highlighting phrases. Our 
current solution is to use the old highlighter (not 
FastVectorHighlighter), which seems to tag each word in a phrase 
independently, rather than as an entire phrase.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message