axis-java-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cédric Chabanois <CChaban...@natsystem.fr>
Subject RE: bug #24896 : I don't understand what we are doing in Abstract XMLE ncoder
Date Thu, 27 Nov 2003 13:50:39 GMT
I still don't understand why we use UTF-8 or UTF-16 there ...

Concerning what we need to escape, this is described at 
http://www.w3.org/TR/REC-xml#syntax

Valid characters are at http://www.w3.org/TR/REC-xml#charsets

UTF-8 aside, I think we did the right thing.

However I think that 
"private static final byte[] AMP = "&amp;".getBytes();"
is not valid.

It should probably be "AMP = "&amp;".getBytes("UTF-8");"
for UTF8 and "AMP = "&amp;".getBytes("UTF-16");" for UTF-16

Concerning the tests that failed after my patch, I understand why they
failed.

In EncodingTest.testUTF8
assertEquals(GERMAN_UMLAUTS, new String(encodedUmlauts.getBytes(),
XMLEncoderFactory.ENCODING_UTF_8));
should be
assertEquals(GERMAN_UMLAUTS, new
String(encodedUmlauts.getBytes(XMLEncoderFactory.ENCODING_UTF_8),
XMLEncoderFactory.ENCODING_UTF_8));
or (simpler)
assertEquals(GERMAN_UMLAUTS, encodedUmlauts);

However it does not test much ... 
It just test that the string given (which does not need to be escaped) to
encoder.encode has not been modified by it.

Cédric


> -----Message d'origine-----
> De : Davanum Srinivas [mailto:dims@yahoo.com]
> Envoyé : mercredi 26 novembre 2003 17:01
> À : axis-dev@ws.apache.org
> Objet : Re: bug #24896 : I don't understand what we are doing in
> AbstractXMLE ncoder
> 
> 
> See http://nagoya.apache.org/bugzilla/show_bug.cgi?id=19327 
> for more info.
> 
> --- Cédric_Chabanois <CChabanois@natsystem.fr> wrote:
> > Hi all,
> > 
> > My correction for bug #24896 worked ie xml sent is in UTF-8 
> format (before
> > french accents, chinese characters ... were not transmitted 
> correctly) but I
> > don't really understand what we are doing In AbstractXMLEncoder and
> > UTF8Encoder :
> > encode method takes a java String.
> > This string is converted to a byte array in UTF-8 (using
> > String.getBytes("UTF-8")) and
> > & becomes "&amp"
> > " becomes "&quot"
> > < becomes "&lt"
> > > becomes "&gt"
> > all other characters are encoded using UTF-8 (appendEncoded 
> method in
> > UTF8Encoder).
> > 
> > Then the characters are converted back to a string (using 
> UTF-8 charset
> > since my patch and using platform's default charset before 
> my patch : the
> > bytes were not valid for the default charset)
> > 
> > I wonder why we use an UTF-8 byte array there just to 
> reconvert it to a
> > string after since all we do is to convert some characters 
> (& -> &amp ...).
> > 
> > There is probably something I missed somewhere ...
> > 
> > Cédric
> 
> 
> =====
> Davanum Srinivas - http://webservices.apache.org/~dims/
> 

Mime
View raw message