Return-Path: Delivered-To: apmail-stdcxx-commits-archive@www.apache.org Received: (qmail 59185 invoked from network); 6 Feb 2008 21:06:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2008 21:06:54 -0000 Received: (qmail 4447 invoked by uid 500); 6 Feb 2008 21:06:47 -0000 Delivered-To: apmail-stdcxx-commits-archive@stdcxx.apache.org Received: (qmail 4431 invoked by uid 500); 6 Feb 2008 21:06:47 -0000 Mailing-List: contact commits-help@stdcxx.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@stdcxx.apache.org Delivered-To: mailing list commits@stdcxx.apache.org Received: (qmail 4422 invoked by uid 99); 6 Feb 2008 21:06:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2008 13:06:47 -0800 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Feb 2008 21:06:38 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id A5046D2D5 for ; Wed, 6 Feb 2008 21:06:30 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: commits@stdcxx.apache.org Date: Wed, 06 Feb 2008 21:06:30 -0000 Message-ID: <20080206210630.7767.21454@eos.apache.org> Subject: [Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change notification. The following page has been changed by TravisVitek: http://wiki.apache.org/stdcxx/LocaleLookup ------------------------------------------------------------------------------ [[Anchor(Definitions)]] = Definitions = - canonical language code: The field is two lowercase characters that represent the language as defined by [#References ISO-639]. + '''canonical language code''': The {{{}}} field is two lowercase characters that represent the language as defined by [#References ISO-639]. - canonical country code: The field is two uppercase letters that represent the country as defined by [#References ISO-3166]. + '''canonical country code''': The {{{}}} field is two uppercase letters that represent the country as defined by [#References ISO-3166]. - canonical codeset code: The field is a string describing the encoding character set. For our purposes, the codeset is the preferred MIME name of the codeset as defined by [#References IANA]. + '''canonical codeset code''': The {{{}}} field is a string describing the encoding character set. For our purposes, the codeset is the preferred MIME name of the codeset as defined by [#References IANA]. - canonical locale name: A complete locale name in the format _.. Each field uses the canonical representation described above. [ex. en_US.ISO-8859-1] + '''canonical locale name''': A complete locale name in the format {{{_.}}}. Each field uses the canonical representation described above. [ex. {{{en_US.ISO-8859-1}}}] - native locale name: The locale name used by the local operating system. [ex. English_United States.1252, en] + '''native locale name''': The locale name used by the local operating system. [ex. {{{English_United States.1252}}}, {{{en}}}] - locale locale name: See native locale name. + '''locale locale name''': See native locale name. [[Anchor(Plan)]] = Plan = @@ -29, +29 @@ Given a query string + {{{ {en,fr,*}_{CA,US,FR,CN}.* + }}} we would apply brace expansion to get the following list of expressions + {{{ en_CA.* en_US.* en_FR.* @@ -45, +48 @@ *_US.* *_FR.* *_CN.* + }}} Once we have this list of expressions, we would enumerate all of the installed locales, and then search through them looking for locale names that match one of those regular expressions. The actual matching would be done using rw_fnmatch(). - Every platform has a unique list of locales available. For example, Windows sytems use 'English' as a language name, but most *nix systems the canonical 'en' or in some cases 'EN'. This problem exists for the language, country and codeset fields of the locale name. To deal with this, we need to provide a mapping between the native names and the canonical names that we plan to use in the query string. It has been suggested that the mapping give a list of all known native locale names for each canonical locale name. The current suggestion is to provide one table with a list of all native locale names and the canonical names for all platforms. For efficiency, it was decided that this table include other information that may be useful such as MB_CUR_LEN for each of those locales. + Every platform has a unique list of locales available. For example, Windows sytems use {{{English}}} as a language name, but most *nix systems the canonical {{{en}}} or in some cases {{{EN}}}. This problem exists for the language, country and codeset fields of the locale name. To deal with this, we need to provide a mapping between the native names and the canonical names that we plan to use in the query string. It has been suggested that the mapping give a list of all known native locale names for each canonical locale name. The current suggestion is to provide one table with a list of all native locale names and the canonical names for all platforms. For efficiency, it was decided that this table include other information that may be useful such as {{{MB_CUR_LEN}}} for each of those locales. When we enumerate the list of installed locales we would use this data to map the locally installed locale name to the canonical locale name. For lookup purposes we use the canonical name, and once we've found a match, we provide the native locale name back to the user. [[Anchor(Issues)]] = Issues = - Now that I'm collecting the list of installed locales to build up this table, I've noticed a few issues with the name mapping. One issue is that a single native locale name may map to a different canonical locale name on different platforms. For example, `es_BO' maps to `es_BO.ISO-8859-15' on AIX, but it maps to `es_BO.ISO-8859-1' on Linux and SunOS. Another issue is that the data associated with each of the canonical locales, like MB_CUR_LEN, is different on each platform. The ar_DZ.UTF-8 locale uses a 6 byte codeset on Linux, but a 4 byte codeset on other platforms. + Now that I'm collecting the list of installed locales to build up this table, I've noticed a few issues with the name mapping. One issue is that a single native locale name may map to a different canonical locale name on different platforms. For example, {{{es_BO}}} maps to {{{es_BO.ISO-8859-15}}} on AIX, but it maps to {{{es_BO.ISO-8859-1}}} on Linux and SunOS. Another issue is that the data associated with each of the canonical locales, like {{{MB_CUR_LEN}}}, is different on each platform. The {{{ar_DZ.UTF-8}}} locale uses a 6 byte codeset on Linux, but a 4 byte codeset on other platforms. Options... - I can provide one database per-platform that includes all of the locale information for that platform. I could write a utility to create this file for each platform. I could even opt to use this file as the list of installed locales instead of checking the output of `locale -a'. The disadvantage is that the data would have to be verified or completed manually to handle mapping native locales names like 'czech' to a canonical name. Maybe we could skip these? If so, then maybe we could generate this file on the fly before running any tests. + I can provide one database per-platform that includes all of the locale information for that platform. I could write a utility to create this file for each platform. I could even opt to use this file as the list of installed locales instead of checking the output of {{{locale -a}}}. The disadvantage is that the data would have to be verified or completed manually to handle mapping native locales names like {{{czech}}} to a canonical name. Maybe we could skip these? If so, then maybe we could generate this file on the fly before running any tests. - Another option would be to have a seperate mapping for each of the locale name components. That makes it possible to from 'English' to 'en' or from 'iso88591' to 'ISO-8859-1' so I can build up the complete canonical locale name with each of the canonical locale name components. The disadvantage with this is that I may have trouble mapping from locales names like 'czech' to a single canonical name. Maybe I should skip these? + Another option would be to have a seperate mapping for each of the locale name components. That makes it possible to from {{{English}}} to {{{en}}} or from {{{iso88591}}} to {{{ISO-8859-1}}} so I can build up the complete canonical locale name with each of the canonical locale name components. The disadvantage with this is that I may have trouble mapping from locales names like {{{czech}}} to a single canonical name. Maybe I should skip these? [[Anchor(References)]] = References =