stdcxx-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek
Date Tue, 11 Mar 2008 01:06:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change notification.

The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  
  Once we have this list of expressions, we would enumerate all of the installed locales,
and then search through them looking for locale names that match one of those regular expressions.
The actual matching would be done using rw_fnmatch().
  
+ [[Anchor(Part1)]]
+ = Part 1 (STDCXX-714) =
+ 
+ The first thing that we needed was to write the function for doing name matching and add
it to the test suite.. Martin has already added an implementation of [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/fnmatch.cpp
rw_fnmatch](), so that is done.
+ 
+ The second thing that we needed was a function to do brace expansion. After much discussion,
it was decided that the csh brace expansion rules made the most sense. Travis provided an
implementation of two functions for doing brace expansion. The function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp
rw_brace_expand]() does a simple brace expansion on the input string. There is no special
treatment for whitespace, but escapes are properly handled. The function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp
rw_shell_expand]() does whitespace tokenization and collapse, and then does brace expansion
on each token, much like the behavior you would see from the csh shell.
+ 
+ Just for illustration, consider the following string.
+ 
+ {{{
+    a {1,2} b
+ }}}
+ 
+ If you passed this to rw_brace_expand, the result would be
+ 
+ {{{
+    a 1 b a 2 b
+ }}}
+ 
+ If you passed this to rw_shell_expand, the result would be
+ 
+ {{{
+    a 1 2 b
+ }}}
+ 
+ In most cases you would want to use rw_shell_expand(). '''Perhaps ''rw_brace_expand'' should
become an implementation function and the header/source/test should be renamed to shellexp.h/shellexp.cpp/0.shellexp.cpp'''

+ 
+ [[Anchor(Part2)]]
+ = Part 2 (STDCXX-715) =
+ 
+ Every platform has a unique list of locales available. For example, Windows sytems use {{{English}}}
as a language name, but most *nix systems the canonical {{{en}}} or in some cases {{{EN}}}.
This problem exists for all fields of the locale name.
+ 
+ To deal with this, we need to provide a mapping between the native names and the canonical
names that we plan to use in the query string. The plan is to provide one file with a list
of all native locale names and the canonical names that they map to for all platforms. For
efficiency, it would be nice that this table include other information that may be useful
such as {{{MB_CUR_LEN}}} for each of those locales.
+ 
+ I've collected all of the locale data on each of the platforms that are available to me.
During this process, I've noticed a few issues with the name mapping.
+ 
+ One issue is that a single native locale name may map to a different canonical locale name
on different platforms. For example, {{{es_BO}}} maps to {{{es_BO.ISO-8859-15}}} on AIX, but
it maps to {{{es_BO.ISO-8859-1}}} on Linux and SunOS. Consider that our mapping file would
look something like this...
+ 
+ {{{
+   es_BO.ISO-8859-1     es_BO es_BO.ISO8859-1 es_BO.iso88591
+   es_BO.ISO-8859-15    es_BO es_BO.8859-15 ES_BO
+ }}}
+ 
+ If we look up the canonical name {{{es_BO.ISO-8859-1}}} we will see three possible locale
names. If we look through our list of installed locales, we will find {{{es_BO}}}, but it
would be wrong to return that locale because it doesn't actually match on this particular
platform.
+ 
+ So one solution for this might be to get the codeset name and store it in the mapping. This
assumes that it is safe to request a locale using with the a codeset even though the list
of installed locales didn't specify the codset.
+ 
+ Another issue is that the data associated with each of the canonical locales, like {{{MB_CUR_LEN}}},
is different on each platform. The {{{ar_DZ.UTF-8}}} locale uses a 6 byte codeset on Linux,
but a 4 byte codeset on other platforms.
+ 
+ I think the solution for this would be to not store the MB_CUR_LEN value in the file, but
capture it and append it to the canonical locale name when we enumerate the installed locales.
+ 
+ [[Anchor(Part3)]]
+ = Part 3 (STDCXX-716) =
+ 
+ The proposed interface to all of this is a single public function named rw_query_locales().
The signature would be...
+ 
+ {{{
+   char* rw_query_locales(const char* query, size_t count);
+ }}}
+ 
+ The {{{query}}} parameter will be the query string. The {{{count}}} parameter is the maximum
number of locales to return. This allows you to easily limit the number of locales tested.
+ 
+ The expected format of the query string is similar to what is described above, except that
the requested MB_CUR_LEN value will be expected to be part of the query string. The accepted
MB_CUR_LEN value would be seperated from the canonical locale name expression with a period.
An example query string...
+ 
+ {{{
+    "zh_*.*.{5..3} *_FR.*.1"
+ }}}
+ 
+ This would match all 5, 4 and 3 byte encodings of the Chinese language in any country, then
all 1 byte encodings for any language spoken in France.
+ 
+ '''Perhaps we should consider adding an additional parameter to prepend the C/POSIX locales
as there is no way to match them using the canonical locale name matching rules we've laid
out above.'''
  
  [[Anchor(References)]]
  = References =

Mime
View raw message