commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Hooper (JIRA)" <>
Subject [jira] [Commented] (LANG-955) StringEscapeUtils.escapeXml doesn't remove invalid characters
Date Thu, 30 Jan 2014 20:24:10 GMT


Adam Hooper commented on LANG-955:

I agree with the observation, of course.

I differ in that I think most/all users of `escapeXml` think it's doing something that it
isn't actually doing. What's the point of a method called `escapeXml` that you can only use
once you've already mangled your string such that it only contains valid characters? `escapeXml`
as it is today only solves the easy part of a hard problem.

As a user, I was burned by this method. I imagine most or all users of `escapeXml` are unaware
of this subtlety and would prefer an `escapeXml` that does not output invalid XML.

What about the following:

* `escapeXml`: outputs valid XML 1.0 (which is also valid XML 1.1)
* `escapeXml11`: outputs valid XML 1.1 (which may not be valid XML 1.0)
* `escapeXmlEntities`: current functionality (which, I opine, isn't as useful as the other
two -- why use this instead of escapeXml or escapeXml11?)

It would be nice to let users opt for an exception to be thrown when a string is going to
be mangled, too. But maybe that extra feature adds more confusion than benefit....

> StringEscapeUtils.escapeXml doesn't remove invalid characters
> -------------------------------------------------------------
>                 Key: LANG-955
>                 URL:
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.1
>         Environment: Ubuntu 13.10
>            Reporter: Adam Hooper
>              Labels: xml
>             Fix For: Patch Needed
> escapeXml lets non-text characters pass through into XML files:
> {code}
> scala> org.apache.commons.lang3.StringEscapeUtils.escapeXml("\u0004").codePointAt(0)
> res4: Int = 4
> {code}
> I would expect the result to be an exception -- either from StringEscapeUtils (refusing
to encode it) or, preferably, from String.codePointAt, complaining that the string is empty.
\u0004 is not a valid character in XML 1.0, and there is no way to represent it in an XML
document -- not even by escaping it.
> Wikipedia summarizes the characters that are not allowed in XML -- even after escaping: The reason for disallowing them: XML
is a text interchange format, and control characters are not text.
> If StringEscapeUtils.escapeXml allows invalid XML characters through -- whether escaped
or not -- it generates invalid XML. Valid XML parsers will refuse to read such files.

This message was sent by Atlassian JIRA

View raw message