commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject Re: [LANG] Wanted - spec lawyer.
Date Tue, 07 Jul 2009 07:37:51 GMT
On Tue, Jun 30, 2009 at 7:16 AM, John Bollinger<thinman42@yahoo.com> wrote:
>
>
>
> Jörg Schaible wrote:
>> As pointed out http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets and
>> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets define the valid
>> characters for XML 1.0 and 1.1.
>>
>> However, the escape functionality is actually different. If you transport
>> XML (or HTML) in a UTF-8 encoded text file or one encoded by ASCII-7 is a
>> big difference. In the former you don't have to encode anything, while you
>> have to encode anything above 0x7f in the latter case. And this applies to
>> XML, HTML or Java source files at equal level.
>>
>> The character set definition of the two XML versions is a vertical condition
>> set. An attempt to encode a character outside the XML definition is
>> actually a situation that cannot be handled and should raise an exception
>> (like every XML parser will do anyway).
>>
>> Therefore the question is, whether (Un)EscapeUtils should actually be an
>> instance initialized with the target character encoding. And that raises
>> the question how close we're actually at reimplementing
>> java.nio.Charset.encode.
>
> As I understand it, the basic idea of StringEscapeUtils.escapeXml() is to convert arbitrary
character data from memory (a String) into a character sequence that has the same meaning
when it appears literally in XML character data.  This is a conversion from character data
to character data, so character encoding is not directly relevant for this use (and this is
a fundamental difference from Charset.encode()).  The characters that must be escaped for
this purpose are well defined by the XML specifications.
>
> The appearance of an encoding attribute in the xml declaration
> notwithstanding, the character encoding of an XML document is a
> property of a representation of the document, not a property of the
> document itself.  There is therefore a *separate*, albeit related, consideration of
escaping characters that cannot be expressed in a particular character encoding, so as to
be able to encode the document to a byte sequence without data loss. This is a useful thing
to do, and it is compatible with the main objective, but I think it would be well to avoid
conflating the two as an indivisible task.  They can be performed in one pass by one method,
but they are logically distinct behaviors.
>
> If StringEscapeUtils wants to support the second use, then it needs a way for the user
to tell it which additional characters to escape.  One possibility would be to pass it a
Charset which the user intends to apply (later) to encode the characters.  StringEscapeUtils
could then escape those input characters for which Charset.canEncode() returns false.
>
> Yet another separate question has arisen as to how to handle input characters which cannot
appear in any way in a well formed XML (1.0 / 1.1) document, even as character references
(e.g. U+0000).  I'm not so certain that StringEscapeUtils needs to be concerned about that,
and it would simplify things immensely if it considered that out of scope.  Among other effects,
I believe that would moot the distinction between XML 1.0 and XML 1.1 (and future versions)
for this class.  In addition, I strongly suspect that there are multiple production applications
that (mis)use XML in a way that would be broken if character references to characters outside
the XML character set were flagged as application errors; it would be considerate for StringEscapeUtils
to be compatible with such (mis)use.
>

Thanks Jörg and John.

Agreed with John that I don't think charsets are a blocker here or API
feature. I think it's a use case to be aware of though and might hint
at the differing requests.

The general aim, I think, should be to get a default behaviour that
blends spec-right with what-I-expect-right. Then the framework
approach allows us to easily have users who would prefer it be another
way put their own methods together. We could even include the
Exception throwing as its own translator but not make it a default.

So... starting with XML.

The simplest claim is that the following should be escaped:

& < > ' "

Currently we do that and we escape anything above 0x7f. We don't
escape any ctrl characters (for example under 0x20).

Is there any reason to complicate things further? Can we keep
escapeXml on the expected 5 characters, and let users who want to
escape more add in more translators or write their own?

On the subject of more translators....

There's the existing > 0x7f NumericEntityEscaper. I suspect it might
be worth defining that as a 'constant'. It also seems worth defining a
NumericEntityEscaper (or some such) for the values less than 0x20
except the special ones of newline etc.

Are there particular Exception ones that would be worth defining? An
ExceptionTranslator that you can put on the front of the chain to
error if any illegal chars are found? Maybe a few of these to match
the various bits of BNF in the spec?

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message