tomcat-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Preißer <kpreis...@apache.org>
Subject International characters in source files and SVN commit messages (was: RE:r1525975)
Date Wed, 25 Sep 2013 14:52:02 GMT
Hi all,

> -----Original Message-----
> From: kpreisser@apache.org [mailto:kpreisser@apache.org]
> Sent: Tuesday, September 24, 2013 9:11 PM

> --- tomcat/site/trunk/xdocs/whoweare.xml (original)
> +++ tomcat/site/trunk/xdocs/whoweare.xml Tue Sep 24 19:10:44 2013
> @@ -100,6 +100,9 @@ A complete list of all the Apache Commit
>  <p><b>Costin Manolache</b> (costin at apache.org)<br/></p>
>  <!--Your bio goes here-->
> 
> +<p><b>Konstantin Preißer</b> (kpreisser at apache.org)<br/></p>

When editing the whoweare.xml, I wrote the "ß" character (sharp s) which is now displayed
as "ß" in the commit message, because the source XML file is encoded in UTF-8 (the default
encoding for XML files).

As far as I understand, SVN needs to treat changes in text files at byte-level, not at character-level,
to be independent from character encodings. Therefore e.g. ".patch" files don't have a character
encoding as they describe changes at byte-level.

However, when the Commit E-Mail is sent, the bytes need to be converted to characters, and
it seems the SVN commit diff is interpreted as ISO-8859-1 (or Windows-1252). Therefore, the
UTF-8 bytes 0xC3 0x9F are displayed as "ß", instead of "ß".

That would be the preferred way to handle such issues? One way I can think would be to XML-encode
such characters ("ß" as "&#xDF;"). However, personally I would rather not do this, but
write such characters directly ("ß"), so that the source is better readable (and encodings
like UTF-8 guarantee that the characters are interpreted the same on each system, independently
from the system language or geographic location).

Could it be possible to change SVN Commit E-Mail system so that it may interpret diffs as
UTF-8 instead of ISO-8859-1 (assuming all files which contain bytes > 0x7F are encoded
as UTF-8)? (Or, that it tries to decode it as UTF-8, and if it fails, decode it as ISO-8859-1
?)

For example, when I use TortoiseSVN to view the unified diff of r152597, then it prints the
"ß" character, so it seems to interpret it as UTF-8.

Can you give me a hint?

Thanks!

Kind regards,
Konstantin Preißer


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org


Mime
View raw message