tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Preißer <>
Subject RE: International characters in source files and SVN commit messages (was: RE:r1525975)
Date Wed, 25 Sep 2013 15:36:45 GMT
Hi Mark,

thanks for the reply.

> -----Original Message-----
> From: Mark Thomas []
> Sent: Wednesday, September 25, 2013 5:01 PM

> > One way I can
> > think would be to XML-encode such characters ("ß" as "&#xDF;").
> > However, personally I would rather not do this, but write such
> > characters directly ("ß"), so that the source is better readable (and
> > encodings like UTF-8 guarantee that the characters are interpreted
> > the same on each system, independently from the system language or
> > geographic location).
> I don't like the idea of using XML encoding at all.

Just to avoid a misunderstanding, with "XML encoding" you mean numeric character references
like &#nnn; ?

> > Could it be possible to change SVN Commit E-Mail system so that it
> > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all
> > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it
> > tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1
> > ?)
> This is a question for infra. If UTF-8 fails then ISO-8859-1 is going to
> fail as well.

I mean, to guess a character encoding by first decoding it as UTF-8, and if it fails, assume
the file was encoded as ISO-8859-1/Windows-1252. This approach seems to be used by some programs
to decide if the file was encoded as UTF-8 or as ANSI when it doesn't have BOM bytes.

For example, consider a file that contains only ASCII characters (< 0x7F) stored as single-byte-per-character.
As UTF-8 is ASCII-compatble, you will get the same results if you interpret it as UTF-8 and
with ISO-8859-1.

However, if you have a file that contains "äöü" (german umlaut characters) as ISO-8859-1
(Bytes: E4 F6 FC), then UTF-8 decoding will fail because the bytes after the one which starts
with 11xxxxxx (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will succeed.

This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) seems to be used by
programs like Notepad++ when opening text files without a BOM, and by TortoiseSVN when displaying
file changes, and seems to be working well if you have files with either UTF-8 or ISO-8859-1/Windows-1252
(or other local  encodings). Of course, this will not always work, e.g. if your text file
that is encoded with ISO-8859-1 actually contains text like "ß". (Personally, for my projects
I use UTF-8 for everything :) )

I was asking because I saw some i18n files like "" that encode non-ASCII
characters with "\uXXXX", and I'd like to know if it is okay to put characters "ß" character
in the XML file without encoding it by a numeric character reference, while the Commit E-Mails
don't use UTF-8. If you are okay with this, then I don't mind changing the encoding for the
SVN Commit E-Mails.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message