Return-Path: X-Original-To: apmail-tomcat-dev-archive@www.apache.org Delivered-To: apmail-tomcat-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1AD410982 for ; Wed, 25 Sep 2013 15:37:22 +0000 (UTC) Received: (qmail 28548 invoked by uid 500); 25 Sep 2013 15:37:18 -0000 Delivered-To: apmail-tomcat-dev-archive@tomcat.apache.org Received: (qmail 28471 invoked by uid 500); 25 Sep 2013 15:37:17 -0000 Mailing-List: contact dev-help@tomcat.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Tomcat Developers List" Delivered-To: mailing list dev@tomcat.apache.org Received: (qmail 28451 invoked by uid 99); 25 Sep 2013 15:37:11 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 15:37:11 +0000 Received: from localhost (HELO NamePC) (127.0.0.1) (smtp-auth username kpreisser, mechanism login) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 15:37:11 +0000 From: =?UTF-8?Q?Konstantin_Prei=C3=9Fer?= To: "'Tomcat Developers List'" References: <000c01ceb9fe$cd29bf20$677d3d60$@apache.org> <5242FAC1.8010107@apache.org> In-Reply-To: <5242FAC1.8010107@apache.org> Subject: RE: International characters in source files and SVN commit messages (was: RE:r1525975) Date: Wed, 25 Sep 2013 17:36:45 +0200 Message-ID: <000d01ceba05$0c4786a0$24d693e0$@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Outlook 15.0 Thread-Index: AQJisy93KmLU7GHW+umQz0TFJSkieALc8+LkmJe0iiA= Content-Language: de Hi Mark, thanks for the reply. > -----Original Message----- > From: Mark Thomas [mailto:markt@apache.org] > Sent: Wednesday, September 25, 2013 5:01 PM > > One way I can > > think would be to XML-encode such characters ("=C3=9F" as "ß"). > > However, personally I would rather not do this, but write such > > characters directly ("=C3=9F"), so that the source is better = readable (and > > encodings like UTF-8 guarantee that the characters are interpreted > > the same on each system, independently from the system language or > > geographic location). >=20 > I don't like the idea of using XML encoding at all. Just to avoid a misunderstanding, with "XML encoding" you mean numeric = character references like &#nnn; ? > > Could it be possible to change SVN Commit E-Mail system so that it > > may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming all > > files which contain bytes > 0x7F are encoded as UTF-8)? (Or, that it > > tries to decode it as UTF-8, and if it fails, decode it as = ISO-8859-1 > > ?) >=20 > This is a question for infra. If UTF-8 fails then ISO-8859-1 is going = to > fail as well. I mean, to guess a character encoding by first decoding it as UTF-8, and = if it fails, assume the file was encoded as ISO-8859-1/Windows-1252. = This approach seems to be used by some programs to decide if the file = was encoded as UTF-8 or as ANSI when it doesn't have BOM bytes. For example, consider a file that contains only ASCII characters (< = 0x7F) stored as single-byte-per-character. As UTF-8 is ASCII-compatble, = you will get the same results if you interpret it as UTF-8 and with = ISO-8859-1. However, if you have a file that contains "=C3=A4=C3=B6=C3=BC" (german = umlaut characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding = will fail because the bytes after the one which starts with 11xxxxxx = (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will = succeed. This approach to guess the encoding (UTF-8 vs. ISO-8859-1/Windows-1252) = seems to be used by programs like Notepad++ when opening text files = without a BOM, and by TortoiseSVN when displaying file changes, and = seems to be working well if you have files with either UTF-8 or = ISO-8859-1/Windows-1252 (or other local encodings). Of course, this = will not always work, e.g. if your text file that is encoded with = ISO-8859-1 actually contains text like "=C3=83=C5=B8". (Personally, for = my projects I use UTF-8 for everything :) ) I was asking because I saw some i18n files like = "LocalStrings_ja.properties" that encode non-ASCII characters with = "\uXXXX", and I'd like to know if it is okay to put characters "=C3=9F" = character in the XML file without encoding it by a numeric character = reference, while the Commit E-Mails don't use UTF-8. If you are okay = with this, then I don't mind changing the encoding for the SVN Commit = E-Mails. Thanks! Konstantin --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org For additional commands, e-mail: dev-help@tomcat.apache.org