Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Tomcat Developers List" <dev@tomcat.apache.org>
From: =?UTF-8?Q?Konstantin_Prei=C3=9Fer?= <kpreisser@apache.org>
To: "'Tomcat Developers List'" <dev@tomcat.apache.org>
References: <000c01ceb9fe$cd29bf20$677d3d60$@apache.org>
 <5242FAC1.8010107@apache.org>
In-Reply-To: <5242FAC1.8010107@apache.org>
Subject: RE: International characters in source files and SVN commit messages
 (was: RE:r1525975)
Date: Wed, 25 Sep 2013 17:36:45 +0200
Message-ID: <000d01ceba05$0c4786a0$24d693e0$@apache.org>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Thread-Index: AQJisy93KmLU7GHW+umQz0TFJSkieALc8+LkmJe0iiA=
Content-Language: de

Hi Mark,

thanks for the reply.

> -----Original Message-----
> From: Mark Thomas [mailto:markt@apache.org]
> Sent: Wednesday, September 25, 2013 5:01 PM

> > One way I can
> > think would be to XML-encode such characters ("=C3=9F" as "&#xDF;").
> > However, personally I would rather not do this, but write such
> > characters directly ("=C3=9F"), so that the source is better =
readable (and
> > encodings like UTF-8 guarantee that the characters are interpreted
> > the same on each system, independently from the system language or
> > geographic location).
>=20
> I don't like the idea of using XML encoding at all.

Just to avoid a misunderstanding, with "XML encoding" you mean numeric =
character references like &#nnn; ?


> > Could it be possible to change SVN Commit E-Mail system so that it
> > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all
> > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it
> > tries to decode it as UTF-8, and if it fails, decode it as =
ISO-8859-1
> > ?)
>=20
> This is a question for infra. If UTF-8 fails then ISO-8859-1 is going =
to
> fail as well.

I mean, to guess a character encoding by first decoding it as UTF-8, and =
if it fails, assume the file was encoded as ISO-8859-1/Windows-1252. =
This approach seems to be used by some programs to decide if the file =
was encoded as UTF-8 or as ANSI when it doesn't have BOM bytes.

For example, consider a file that contains only ASCII characters (< =
0x7F) stored as single-byte-per-character. As UTF-8 is ASCII-compatble, =
you will get the same results if you interpret it as UTF-8 and with =
ISO-8859-1.

However, if you have a file that contains "=C3=A4=C3=B6=C3=BC" (german =
umlaut characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding =
will fail because the bytes after the one which starts with 11xxxxxx =
(binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will =
succeed.

This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) =
seems to be used by programs like Notepad++ when opening text files =
without a BOM, and by TortoiseSVN when displaying file changes, and =
seems to be working well if you have files with either UTF-8 or =
ISO-8859-1/Windows-1252 (or other local  encodings). Of course, this =
will not always work, e.g. if your text file that is encoded with =
ISO-8859-1 actually contains text like "=C3=83=C5=B8". (Personally, for =
my projects I use UTF-8 for everything :) )


I was asking because I saw some i18n files like =
"LocalStrings_ja.properties" that encode non-ASCII characters with =
"\uXXXX", and I'd like to know if it is okay to put characters "=C3=9F" =
character in the XML file without encoding it by a numeric character =
reference, while the Commit E-Mails don't use UTF-8. If you are okay =
with this, then I don't mind changing the encoding for the SVN Commit =
E-Mails.

Thanks!

Konstantin


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org