lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 30621] - HTML Parser doesn't decode character references in attributes
Date Tue, 07 Sep 2004 09:15:28 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=30621>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=30621

HTML Parser doesn't decode character references in attributes





------- Additional Comments From Dave.Sparks@teamware.co.uk  2004-09-07 09:15 -------
I didn't attach my patch because although it fixes the problem for us it's a
kludge (IMO).  The text inside an attribute value is being parsed twice: the
grammar definition treats it as a simple string, and my patch then parses that
string to resolve the character references in <img alt="..."> and <meta
content="...">; other attribute values are left alone.  I suspect that character
references should be resolved in other attribute values such as <meta
name="..."> even though it should never be necessary to use a character
reference here.  The HTML definition isn't entirely clear - perhaps the SGML
standard is clearer.

Since the HTML parser is an example, it shouldn't include kludges like this
(again, IMO).  The grammar describing an attribute value ought to be correct. 
Since I needed a quick fix, the kludge is sufficient for me.  No-one else has
complained (yet) so I don't see any need to rush a poor solution into the
released product.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message