commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Gregory" <ggreg...@seagullsoftware.com>
Subject RE: [Lang] escapeXML() -> Not escaping low characters
Date Tue, 18 Apr 2006 17:51:02 GMT
Here is an excerpt from the XML 1.1 spec (http://www.w3.org/TR/xml11/):

--
The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their
literal form, except when used as markup delimiters, or within a comment, a processing instruction,
or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric
character references or the strings "&amp;" and "&lt;" respectively. The right angle
bracket (>) MAY be represented using the string "&gt;", and MUST, for compatibility,
be escaped using either "&gt;" or a character reference when it appears in the string
"]]>" in content, when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which does not contain
the start-delimiter of any markup or the CDATA-section-close delimiter, "]]>". In a CDATA
section, character data is any string of characters not including the CDATA-section-close
delimiter.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote
character (') MAY be represented as "&apos;", and the double-quote character (") as "&quot;".
--

Here is how I read this for our use:

The escapeXml method (IMO) is meant to produce the contents of XML elements and attributes.
In order to produce valid XML content for an attribute or an element, the & and < characters
must be escaped. For compatibility the > character must also be escaped when part ofg "]]>".
The tricky part is what to do with single and double quote characters. When the content is
for an XML element, you not need do anything. When the content is for an XML attribute you
need to know if the attribute is delimited with a single or double quote in order to only
escape what is needed. I would not want to produce an overly escaped string.

So, "low" or "high" characters should not be escaped.

All of this leads me to think that we should deprecate escapeXml and create: escapeXmlElementContent(String)
and escapeXmlAttributeContents(String, char) where the char denotes which quote character
to escape.

Gary

> -----Original Message-----
> From: Henri Yandell [mailto:flamefew@gmail.com]
> Sent: Tuesday, April 18, 2006 9:55 AM
> To: Jakarta Commons Users List
> Subject: Re: [Lang] escapeXML() -> Not escaping low characters
> 
> On 3/31/06, David López Muñoz <dlm@tid.es> wrote:
> > Hello,
> >
> > I'm trying to escape some texts to be xml-valid and I'm using
> StringEscapeUtils.escapeXml().
> >
> > I found a problem with low characteres such as #18. They don't seem to be
> escaped, and therefore they are mixed together with other characteres as if
> there were normal characteres such as 'a', '1' etc.
> >
> > Am I doing sth wrong? I'm using commons-lang 2.1. Is it a known bug already
> solved in newer versions?
> 
> Sorry for lack of reply. Definitely not fixed yet, and thanks for
> reporting it in bugzilla. There's another bug that complains that high
> characters ARE getting escaped - so definitely something that's up for
> debate :)
> 
> Would all low-chars want to be escaped? I suspect that people wouldn't
> want newlines suddenly being escaped and turning the xml into a single
> line. Anyone got any idea if the XML spec even talks about low-chars?
> 
> Hen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-user-help@jakarta.apache.org


Mime
View raw message