lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 33389] - [PATCH] Highlighter: Delegate output escaping to Formatter
Date Thu, 03 Feb 2005 18:08:43 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=33389>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=33389





------- Additional Comments From nicko@apache.org  2005-02-03 19:08 -------
(In reply to comment #3)
> Not only do we break the Formatter interface, we also break the
> behaviour for those upgrading. 

If you need to keep the Formatter interface stable then we can do this by adding
another interface that extends Formatter. The Highlighter can adapt its
behaviour depending on the actual instance supplied.

Yes the patch does cause a behaviour change. The SimpleHTMLFormatter,
GradientFormatter, and SpanGradientFormatter will HTML encode their output where
previously they did not. If that behaviour is not acceptable then they can be
changed to implement the encodeText method such that it just passes back the 
originalText, this would give exactly the same behaviour as with the current
version.


> If people have applications which already escape
> the content (eg before or after calling highlighter), when they upgrade to the
> proposed new version their content will be double-escaped. 

I could not find a reliable way of escaping the text if the original search text
contained HTML content. If the text is HTML escaped before the TokenStream is
built then the highlighter is parsing a potentially different stream to the
original search. Also the TokenStream and the text must be kept in sync
otherwise the TokenGroup indexes are off. The text cannot be escaped after
highlighting because the original contains the same mark-up used to highlight
the matching tokens. It would be possible to get the SimpleHTMLFormatter to use
2 specific GUIDs to wrap the highlight tokens as it is unlikely that these would
also appear in the original text - these could then be replaced with the correct
tags after HTML escaping.


> They will also incur
>  the additional performance overhead for encoding where it may not be required
> (Note: could initialize htmlEncode's StringBuffer to at least
plainText.length()) .

That would certainly help.

If we are going to break the Formatter interface how about changing the methods
to take a java.io.Writer or java.io.PrintWriter this would reduce the number of
temporary strings created?


> Will the htmlEncode function you've defined work for more exotic
> character sets eg CKY?

No, it is a 'brain dead' implementation.

Commons Lang has an implementation that could be used: escapeHtml(String) in 
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/lang/trunk/src/java/org/apache/commons/lang/StringEscapeUtils.java?rev=137958&view=markup

Do you really want to include something like that in with the Highlighter? Is
the Highlighter supposed to do HTML highlighting out of the box, or should it
just be a toolkit that makes it easy to do so.


Nicko

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message