Mailing-List: contact axis-dev-help@ws.apache.org; run by ezmlm
Precedence: bulk
Reply-To: axis-dev@ws.apache.org
Message-ID: <64510FFDEBCAD511B8CB00065B055DE3093B34@galaxy.natsys.fr>
From: =?iso-8859-1?Q?C=E9dric_Chabanois?= <CChabanois@natsystem.fr>
To: "'axis-dev@ws.apache.org'" <axis-dev@ws.apache.org>
Subject: RE: bug #24896 : I don't understand what we are doing in Abstract
	XMLE ncoder
Date: Thu, 27 Nov 2003 14:50:39 +0100
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I still don't understand why we use UTF-8 or UTF-16 there ...

Concerning what we need to escape, this is described at=20
http://www.w3.org/TR/REC-xml#syntax

Valid characters are at http://www.w3.org/TR/REC-xml#charsets

UTF-8 aside, I think we did the right thing.

However I think that=20
"private static final byte[] AMP =3D "&amp;".getBytes();"
is not valid.

It should probably be "AMP =3D "&amp;".getBytes("UTF-8");"
for UTF8 and "AMP =3D "&amp;".getBytes("UTF-16");" for UTF-16

Concerning the tests that failed after my patch, I understand why they
failed.

In EncodingTest.testUTF8
assertEquals(GERMAN_UMLAUTS, new String(encodedUmlauts.getBytes(),
XMLEncoderFactory.ENCODING_UTF_8));
should be
assertEquals(GERMAN_UMLAUTS, new
String(encodedUmlauts.getBytes(XMLEncoderFactory.ENCODING_UTF_8),
XMLEncoderFactory.ENCODING_UTF_8));
or (simpler)
assertEquals(GERMAN_UMLAUTS, encodedUmlauts);

However it does not test much ...=20
It just test that the string given (which does not need to be escaped) =
to
encoder.encode has not been modified by it.

C=E9dric


> -----Message d'origine-----
> De : Davanum Srinivas [mailto:dims@yahoo.com]
> Envoy=E9 : mercredi 26 novembre 2003 17:01
> =C0 : axis-dev@ws.apache.org
> Objet : Re: bug #24896 : I don't understand what we are doing in
> AbstractXMLE ncoder
>=20
>=20
> See http://nagoya.apache.org/bugzilla/show_bug.cgi?id=3D19327=20
> for more info.
>=20
> --- C=E9dric_Chabanois <CChabanois@natsystem.fr> wrote:
> > Hi all,
> >=20
> > My correction for bug #24896 worked ie xml sent is in UTF-8=20
> format (before
> > french accents, chinese characters ... were not transmitted=20
> correctly) but I
> > don't really understand what we are doing In AbstractXMLEncoder and
> > UTF8Encoder :
> > encode method takes a java String.
> > This string is converted to a byte array in UTF-8 (using
> > String.getBytes("UTF-8")) and
> > & becomes "&amp"
> > " becomes "&quot"
> > < becomes "&lt"
> > > becomes "&gt"
> > all other characters are encoded using UTF-8 (appendEncoded=20
> method in
> > UTF8Encoder).
> >=20
> > Then the characters are converted back to a string (using=20
> UTF-8 charset
> > since my patch and using platform's default charset before=20
> my patch : the
> > bytes were not valid for the default charset)
> >=20
> > I wonder why we use an UTF-8 byte array there just to=20
> reconvert it to a
> > string after since all we do is to convert some characters=20
> (& -> &amp ...).
> >=20
> > There is probably something I missed somewhere ...
> >=20
> > C=E9dric
>=20
>=20
> =3D=3D=3D=3D=3D
> Davanum Srinivas - http://webservices.apache.org/~dims/
>=20