apr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Роман Донченко <DXDra...@yandex.ru>
Subject Misbehaviour of apr_os_locale_encoding on Windows
Date Mon, 12 Apr 2010 14:48:39 GMT

On Windows, apr_os_locale_encoding returns the code page of the default  
user locale (to be precise, it uses the current thread locale, but that  
starts as the default user locale), and, well, I have a problem with that.

The problem is that that code page is essentially meaningless. See [1] for  
a discussion of what various default locales mean and note that the code  
page used by non-Unicode application is the one from the default *system*  
locale, and that's the code page that I think is the right choice for  
apr_os_locale_encoding. Why? Well, consider this example (Unicode-capable  
reader required).

Let's set our user locale to English (Canada) (code page = 1252), our  
system locale to Russian (code page = 1251), and try to use Subversion:

F:\Temp>svnadmin create testrepo

F:\Temp>svn co file:///F:/Temp/testrepo testwc
Checked out revision 0.

F:\Temp>echo. > testwc/test.txt

F:\Temp>svn add testwc\test.txt
A         testwc\test.txt

F:\Temp>svn ci testwc -m "В лесу родилась ёлочка."
Adding         testwc\test.txt
Transmitting file data .
Committed revision 1.

F:\Temp>svn log testwc\test.txt
r1 | ?iiai | 2010-04-12 17:58:02 +0400 (Mon, 12 Apr 2010) | 1 line

A eano ?iaeeanu ?ei?ea.

What happened here? My log message was initially passed to svn in CP1251,  
because that's the code page of the system locale. svn, however,  
interpreted it as CP1252, which led it to believe that the message was  
actually " ëåñó ðîäèëàñü ¸ëî÷êà.". This is obviously broken. It then 

converted the message to CP866, the console output code page, which is  
normally the right course of action, but here it additionally obfuscated  
the message by dropping the accents and some characters. The username was  
mangled in the same way.

Now, I cheated a little, because Subversion doesn't actually use  
apr_os_locale_encoding in this instance, but its internal mechanism for  
determining the code page is the same, and I believe it showcases the  
undesired behaviour well. apr_os_locale_encoding needs to be an encoding  
that can be used to interoperate with the OS and other applications, and  
that's the system locale's code page.

The proposed fix is trivial:

Index: misc/win32/charset.c
--- misc/win32/charset.c	(revision 933252)
+++ misc/win32/charset.c	(working copy)
@@ -30,11 +30,7 @@
  #ifdef _UNICODE
      int i;
-#if defined(_WIN32_WCE)
-    LCID locale = GetUserDefaultLCID();
-    LCID locale = GetThreadLocale();
+    LCID locale = GetSystemDefaultLCID();
      int len = GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, NULL, 0);
      char *cp = apr_palloc(pool, (len * sizeof(TCHAR)) + 2);
      if (0 < GetLocaleInfo(locale, LOCALE_IDEFAULTANSICODEPAGE, (TCHAR*)  
(cp + 2), len))


[1] http://blogs.msdn.com/michkap/archive/2005/02/01/364707.aspx

View raw message