lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <soko...@ifactory.com>
Subject Re: Highlighting html pages
Date Tue, 06 Nov 2012 12:18:45 GMT
On 11/6/2012 3:29 AM, Steve Rowe wrote:
> Hi Scott,
>
> HTMLStripCharFilter doesn't require that its input be valid HTML - there is no assumption
of balanced tags.
>
> Also, highlighted sections could span tags, e.g. if you highlight "this phrase", and
the original HTML looks like:
>
> 	… this<span>phrase</span> …
>
> the highlighting code would have to know to put multiple tags to avoid non-wellformedness,
maybe something like:
>
> 	… <b>this</b><span><b>phrase</b></span> …
>
> If you do develop a solution here, it would be great if you could share it with the community.
>
> Also, I think it would be useful to have an XML-specific stripping char filter - it's
on my long term to-do list :).
>
Steve: see https://issues.apache.org/jira/browse/SOLR-2597. I have 
updates for this, but since no committers took it up, I haven't bothered 
to keep the issue up to date with my latest code.

I would also love to see a tag-balancer for highlighting phrases. Our 
current solution is to use the old highlighter (not 
FastVectorHighlighter), which seems to tag each word in a phrase 
independently, rather than as an entire phrase.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message