tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Preißer <>
Subject RE: International characters in source files and SVN commit messages (was: RE:r1525975)
Date Thu, 26 Sep 2013 22:59:03 GMT
Hi Konstantin,

> -----Original Message-----
> From: Konstantin Kolinko []
> Sent: Friday, September 27, 2013 12:30 AM
> To: Tomcat Developers List
> Subject: Re: International characters in source files and SVN commit
> messages (was: RE:r1525975)
> Regarding whoweare.xml file,  you need to add explicit encoding to the
> top of the file (like it is done in
> tc7.0.x/trunk/webapps/docs/changelog.xml).  Without that I consider
> those files as ISO-8859-1, like the rest of our sources.

Note that for XML files, if the "encoding" flag in the XML declaration is
missing, the encoding is determined by the file's BOM bytes.
If it has none, then the encoding is "UTF-8" [1]. So the XML files which
don't have a "encoding" flag or BOM bytes are UTF-8.
As such, the "whoweare.xml" is already in UTF-8 (but personally I prefer to
explicitly declare the UTF-8 encoding in XML files).

> In the past there were several cases when accented characters in
> Tomcat's changelog files were corrupted during editing (due to a
> conversion done in someone's editor). It was seen in commit message.
> Last time it happened two or three years ago.
> As of now, several xml files in Tomcat (those changelogs) are
> officially UTF-8, and I am OK with people using accented characters
> for new text there until something breaks.
> (Personally, I will probably still use numeric entities, as I do not
> have those characters on my keyboard.)
> AFAIK, TortoiseSVN diff viewer has some logic to autodetect the use of
> 8.

Yes, I guess this is "if it doesn't have a BOM, try to decode as UTF-8; if
it fails, decode as ansi/iso-8859-1" which I mentioned in another mail.
E.g., when a Diff contains the text "aßa" in ISO-8859-1, it will display it
as "aßa" (UTF-8), but when it contains "aßaßa" in ISO-8859-1, then it
displays that one. This seems also be used e.g. by Notepad++.

I think such a logic could also be used by the commit mailer to decide if
the text is UTF-8 or ISO-8859-1 for better readability, but I have no strong
preference for it.

Kind regards,
Konstantin Preißer


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message