stdcxx-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek
Date Wed, 26 Mar 2008 22:03:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change notification.

The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  
  The objective of this project is to provide an interface to make it easy to write localization
tests without the knowledge of platform-specific details (such as locale names) that provide
sufficient code coverage and that complete in a reasonable amount of time (ideally seconds
as opposed to minutes). The interface must make it easy to query the system for locales that
satisfy the specific requirements of each test. For example, most tests that currently use
all installed locales (e.g., the set of tests for the `std::ctype` facet) only need to exercise
a representative sample of the installed locales without using the same locale more than once.
Thus the interface will need to make it possible to specify such a sample. Another example
is tests that attempt to exercise locales in multibyte encodings whose `MB_CUR_MAX` ranges
from 1 to 6 (some of the `std::codecvt` facet tests). The new interface will need to make
it easy to specify such a set of locales without explicitly na
 ming them, and it will need to retrieve such locales without returning duplicates.
  
+ [[Anchor(UseCases)]]
+ == Use Cases ==
+ 
+ The existing locale tests select locales based on a few different criteria. Below is a list
of locales tests and the criteria used for locale selection within those tests.
+ 
+ || Test || Criteria ||
+ || 22.LOCALE.CODECVT.MT.CPP || *1,+ ||
+ || 22.LOCALE.CODECVT.OUT.CPP || *2 ||
+ || 22.LOCALE.CONS.MT.CPP || *1,+ ||
+ || 22.LOCALE.CTYPE.CPP || *2 ||
+ || 22.LOCALE.CTYPE.IS.CPP || *2 ||
+ || 22.LOCALE.CTYPE.MT.CPP || *1,+ ||
+ || 22.LOCALE.CTYPE.NARROW.CPP || *2 ||
+ || 22.LOCALE.CTYPE.SCAN.CPP || *2 ||
+ || 22.LOCALE.CTYPE.TOLOWER.CPP || *2 ||
+ || 22.LOCALE.CTYPE.TOUPPER.CPP || *2 ||
+ || 22.LOCALE.GLOBALS.MT.CPP || *8,+ ||
+ || 22.LOCALE.MESSAGES.CPP || *7 ||
+ || 22.LOCALE.MONEY.GET.MT.CPP || *1,+ ||
+ || 22.LOCALE.MONEY.PUT.MT.CPP || *1,+ ||
+ || 22.LOCALE.MONEYPUNCT.CPP || *4 ||
+ || 22.LOCALE.MONEYPUNCT.MT.CPP || *1,+ ||
+ || 22.LOCALE.NUM.GET.CPP || *9 ||
+ || 22.LOCALE.NUM.GET.MT.CPP || *1,+ ||
+ || 22.LOCALE.NUM.PUT.CPP || *9 ||
+ || 22.LOCALE.NUM.PUT.MT.CPP || *1,+ ||
+ || 22.LOCALE.NUMPUNCT.MT.CPP || *1,+ ||
+ || 22.LOCALE.STATICS.MT.CPP || *4,+ ||
+ || 22.LOCALE.TIME.GET.CPP || *5,6 ||
+ || 22.LOCALE.TIME.GET.MT.CPP || *1,+ ||
+ || 22.LOCALE.TIME.PUT.MT.CPP || *1,+ ||
+ 
+ * Any locale for which setlocale (LC_ALL, name) will succeed.
+ * Any locale for which setlocale (LC_CTYPE, name) will succeed.
+ * Any locale for which setlocale (LC_NUMERIC, name) will succeed.
+ * All installed locales.
+ * First locale matching a specific name.
+ * First locale matching a regular expression.
+ * First locale that is not an alias for the C/POSIX locale.
+ * Any locale for which setlocale (LC_ALL, name) will succeed, list includes C/POSIX locale.
+ * Any locale for which setlocale (LC_NUMERIC, name) will succeed and decimal_point is not
'.'
+ + Test limits the number of locales tested.
+ 
+ ||<rowstyle="color:red">Note: Most of the MT tests limit the number of locales to
32, so the test failure is not a matter of running against to many locales, it is an issue
of running to many iterations per thread. The 'solution' discussed in this document doesn't
seem to address the actual problem for these tests.||
+ 
  [[Anchor(Definitions)]]
  = Definitions =
  
- '''canonical language code''': The {{{<language>}}} field is two lowercase characters
that represent the language as defined by [#References ISO-639].
+ '''canonical language''': The {{{<language>}}} field is two lowercase characters that
represent the language as defined by [#References ISO-639].
  
- '''canonical country code''': The {{{<COUNTRY>}}} field is two uppercase letters that
represent the country as defined by [#References ISO-3166].
+ '''canonical country''': The {{{<COUNTRY>}}} field is two uppercase letters that represent
the country as defined by [#References ISO-3166].
  
- '''canonical codeset code''': The {{{<CODESET>}}} field is a string describing the
encoding character set. For our purposes, the codeset is the preferred MIME name of the codeset
as defined by [#References IANA].
+ '''canonical codeset''': The {{{<CODESET>}}} field is a string describing the encoding
character set. For our purposes, the codeset is the preferred MIME name of the codeset as
defined by [#References IANA].
- 
- '''canonical locale name''': A complete locale name in the format {{{<language>_<COUNTRY>.<CODESET>}}}.
Each field uses the canonical representation described above. [ex. {{{en_US.ISO-8859-1}}}]
- 
- '''native locale name''': The locale name used by the local operating system. [ex. {{{English_United
States.1252}}}, {{{en}}}]
- 
- '''local locale name''': See native locale name.
  
  [[Anchor(Plan)]]
  = Plan =
  
  This page relates to the issue described in [http://issues.apache.org/jira/browse/STDCXX-608
STDCXX-608]. There has been some discussion both on and off the dev@ list about how to proceed.
This page is here to document what has been discussed.
  
- The plan to meet the [#Objective Objective] is to provide an interface to query the set
of installed locales based on a set of a small number of essential parameters  used by the
localization tests. The interface should make it easy to express conjunction, disjunction,
and negation of the terms (parameters) and support (a perhaps simplified version of) [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
Basic Regular Expression] syntax. We've decided to use shell brace expansion as a means of
expressing logical conjunction between terms: a valid brace expression is expanded to obtain
a set of terms implicitly connected by a logical AND. Individual ('\n'-separated) lines of
the query string are taken to be implicitly connected by a logical OR. This approach models
the [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.html grep] interface with
each line loosely corresponding to the argument of the `-e` option to `grep`.
+ The plan to meet the [#Objective Objective] is to provide an interface to query the set
of installed locales based on a set of a small number of essential parameters used by the
localization tests. The interface should make it easy to express conjunction, disjunction,
and negation of the terms (parameters) and support (a perhaps simplified version of) [http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03
Basic Regular Expression] syntax. We've decided to use shell brace expansion as a means of
expressing logical conjunction between terms: a valid brace expression is expanded to obtain
a set of terms implicitly connected by a logical AND. Individual ('\n'-separated) lines of
the query string are taken to be implicitly connected by a logical OR. This approach models
the [http://www.opengroup.org/onlinepubs/009695399/utilities/grep.html grep] interface with
each line loosely corresponding to the argument of the `-e` option to `grep`.
- 
- Given a query string 
- 
- {{{
-   {en,fr,*}_{CA,US,FR,CN}.*
- }}}
- 
- we would apply brace expansion to get the following list of expressions
- 
- {{{
-   en_CA.*
-   en_US.*
-   en_FR.*
-   en_CN.*
-   fr_CA.*
-   fr_US.*
-   fr_FR.*
-   fr_CN.*
-    *_CA.*
-    *_US.*
-    *_FR.*
-    *_CN.*
- }}}
- 
- Once we have this list of expressions, we would enumerate all of the installed locales,
and then search through them looking for locale names that match one of those regular expressions.
The actual matching would be done using rw_fnmatch().
- 
- ||<rowstyle="color:red"> /!\ Need to specify the format of the locale files!||
  
  [[Anchor(Part1)]]
  = Part 1 (STDCXX-714) =
  
  The first thing that we needed was to write the function for doing Basic Regular Expression
name matching and add it to the test suite.. Martin has already added an implementation of
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/fnmatch.cpp rw_fnmatch](), so that is
done. `rw_fnmatch()` is a simplified implementation of the POSIX [http://www.opengroup.org/onlinepubs/009695399/functions/fnmatch.html
fnmatch] function which supports a simplified and modified form of BRE used in filename globbing.
This is sufficient for what we need in term of regular expression support.
  
- The second thing that we needed was a function to do brace expansion. After much discussion,
it was decided that the csh brace expansion rules made the most sense. Travis provided an
implementation of two functions for doing brace expansion. The function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp
rw_brace_expand]() does a simple brace expansion on the input string. There is no special
treatment for whitespace, but escapes are properly handled. The function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp
rw_shell_expand]() does whitespace tokenization and collapse, and then does brace expansion
on each token, much like the behavior you would see from the csh shell.
+ The second thing that we needed was a function to do brace expansion. After much discussion,
it was decided that the csh brace expansion rules made the most sense. Travis provided an
implementation of a function for doing brace expansion. The function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp
rw_shell_expand]() does whitespace tokenization and collapse, and then does brace expansion
on each token, much like the behavior you would see from the csh shell.
  
  Just for illustration, consider the following string.
  
  {{{
-    a {1,2} b
+    a-{1,2}-b
  }}}
  
- If you passed this to rw_brace_expand, the result would be
+ If you passed this to `rw_shell_expand()` (with ' ' as the seperator), the result would
be
  
  {{{
-    a 1 b a 2 b
+    a-1-b a-2-b
  }}}
- 
- If you passed this to rw_shell_expand, the result would be
- 
- {{{
-    a 1 2 b
- }}}
- 
- In most cases you would want to use `rw_shell_expand()`. '''Perhaps ''rw_brace_expand''
should become an implementation function and the header/source/test should be renamed to shellexp.h/shellexp.cpp/0.shellexp.cpp'''

  
  [[Anchor(Part2)]]
  = Part 2 (STDCXX-715) =
  
  Every platform has a unique list of locales available. For example, Windows sytems use {{{English}}}
as a language name, but most *nix systems the canonical {{{en}}} or in some cases {{{EN}}}.
This problem exists for all fields of the locale name.
  
- To deal with this, we need to provide a mapping between the native names and the canonical
names that we plan to use in the query string. The plan is to provide one file with a list
of all native locale names and the canonical names that they map to for all platforms. For
efficiency, it would be nice that this table include other information that may be useful
such as {{{MB_CUR_LEN}}} for each of those locales.
+ To deal with this, we need to provide a mapping between the native names and the canonical
names that we plan to support in the query string. The plan is to provide these mappings in
data files. We would need at least three different mappings, one each for language, country
and codeset. We would need one additional mapping if we wanted to map from a canonical language
code to a default country code. This would be necessary so that we can map locale names like
{{{russian}}} or {{{ru}}} to an appropriate territory code.
  
+ The format of these files is simple. Here is a grammar
- I've collected all of the locale data on each of the platforms that are available to me.
During this process, I've noticed a few issues with the name mapping.
- 
- One issue is that a single native locale name may map to a different canonical locale name
on different platforms. For example, {{{es_BO}}} maps to {{{es_BO.ISO-8859-15}}} on AIX, but
it maps to {{{es_BO.ISO-8859-1}}} on Linux and SunOS. Consider that our mapping file would
look something like this...
  
  {{{
-   es_BO.ISO-8859-1     es_BO es_BO.ISO8859-1 es_BO.iso88591
-   es_BO.ISO-8859-15    es_BO es_BO.8859-15 ES_BO
+   native-name-list ::= <native-name> | <native-name> ',' <native-name-list>
| '\n' <ws> <native-name-list>
+   line         ::= '#' <comment> | <canonical-name> <native-name-list>
+   line-list    ::= <line> | <line> '\n' <line-list> 
  }}}
  
+ The grammar is comma delimited, so the strings are not to be quoted. Here is an example
to illustrate.
- If we look up the canonical name {{{es_BO.ISO-8859-1}}} we will see three possible locale
names. If we look through our list of installed locales, we will find {{{es_BO}}}, but it
would be wrong to return that locale because it doesn't actually match on this particular
platform.
- 
- Now we use the above data to figure out canonical name from local name, or vice-versa.
  
  {{{
-   es_BO.8859-15 maps to local name es_BO.ISO-8859-15
-   es_BO         maps to local name es_BO.ISO-8859-15 or es_BO.ISO-8859-1
+   # this is a comment line
+ 
+    # _not_ a comment line
+   # the above maps '_not_ a comment line' to the value '#'
+ 
+   # map 'English' to 'en'
+   en	English
+ 
+   # map 'Albanian', 'alb' and 'sqi' to 'sq'
+   sq    Albanian, alb, sqi
+ 
+   # similar to above, except that mapping is multiline
+   cu    Church Slavic, Old Slavonic, Church Slavonic,
+         Old Bulgarian, Old Church Slavonic, chu
  }}}
- 
- How do we know which {{{es_BO}}} is right for this platform?
- 
- One possible direction here is to ask a locale for its codeset. Unfortunately the returned
string needs to be mapped to a canonical string. i.e. it might return {{{iso88591}}} on one
platform, and {{{ISO-8859-1}}} on another.
- 
- If we need to ask a locale for its codeset and then use an additional mapping to get the
canonical codeset name, then why not just provide lookups for each component of the canonical
locale name and look them up individually?
- 
- We would need at least three different mappings. We would need four if we wanted to map
from a language code to a default territory code. This would be necessary so that we can map
locale names like {{{russian}}} or {{{ru}}} to an appropriate territory code.
- 
- {{{
-   # codeset mappings [one to many]
-   ISO-8859-1    8859-1 ISO8859-1
-   ISO-8859-15   8859-15 ISO8859-15
-   1252          CP-1252 IBM-1252
-   1254          CP-1254 IBM-1254
- 
-   # language mappings [one to many]
-   en	English
-   es    Spanish
-   ab    Abkhazian abk
-   sq    Albanian alb sqi
- 
-   # territory mappings [one to many]
-   US   "United States"
-   DE    Germany  
- 
-   # default territory for language mappings [one to one]
-   ru RU
-   cs CZ
- }}}
- 
- The advantage of this scheme over the previous scheme is that if we encounter a locale that
we don't know, we might be able to get a valid canonical name for it. with the previous scheme,
if we can't find a mapping for the name, then we just use the original name as the canonical
name. If we did this, we would be able to build up a canonical name for it, and that would
increase the chances of being able to use it.
- 
- Another issue is that the data associated with each of the canonical locales, like {{{MB_CUR_LEN}}},
is different on each platform. The {{{ar_DZ.UTF-8}}} locale uses a 6 byte codeset on Linux,
but a 4 byte codeset on other platforms.
- 
- I think the logical solution for this would be to not store the {{{MB_CUR_LEN}}} value in
the file, but capture it and append it to the canonical locale name when we enumerate the
installed locales. See notes in Part3 about {{{MB_CUR_LEN}}}.
  
  [[Anchor(Part3)]]
  = Part 3 (STDCXX-716) =
@@ -155, +132 @@

  The proposed interface to all of this is a single public function named rw_query_locales().
The signature would be...
  
  {{{
-   char* rw_query_locales (const char* query, size_t count);
+   char* rw_query_locales (int loc_cat, const char* query, size_t count);
  }}}
  
- The {{{query}}} parameter will be the query string. The {{{count}}} parameter is the maximum
number of locales to return. This allows you to easily limit the number of locales tested.
+ The {{{loc_cat}}} parameter is the locale category to get locales for, just like `rw_locales()`
does in its current implementation. The {{{query}}} parameter will be the query string. The
{{{count}}} parameter is the maximum number of locales to return. This allows you to easily
limit the number of locales returned and eventually tested.
  
- The expected format of the query string is similar to what is described above, except that
the requested {{{MB_CUR_LEN}}} value will be expected to be part of the query string. The
accepted {{{MB_CUR_LEN}}} value would be seperated from the canonical locale name expression
with a space. An example query string...
+ The proposed grammar used by the query string is similar to what is used for the xfail.txt
{{{config}}} string. It is a shell globbed string that has its terms joined with dashes.
  
  {{{
-    {zh_*.* {5..3},*_FR.* 1}
+   <match> is a shell globbing pattern in the format below. All fields 
+   are required. 
+ 
+   iso-country  ::= ISO-639-1 or ISO-639-2 two or three character country code 
+   iso-language ::= ISO-3166 two character language code 
+   iana-codeset ::= IANA codeset name with '-' replaced or removed 
+ 
+   match        ::= <iso-language-expr> '-' <iso-country-expr> '-' <mb_cur_len-expr>
'-' <iana-codeset-expr>
+   match_list   ::= match | match ' ' match_list 
  }}}
  
- This would match all 5, 4 and 3 byte encodings of the Chinese language in any country, then
all 1 byte encodings for any language in France.
+ So, given a query string 
  
+ {{{
+   *-{CA,US}-1-{ISO-8859-1,UTF-8}
+ }}}
+ 
+ this function would internally apply brace expansion to get the following list of expressions
+ 
+ {{{
+   *-CA-1-*-ISO-8859-1 *-CA-1-*-UTF-8 *-US-1-*-ISO-8859-1 *-US-1-*-UTF-8
+ }}}
+ 
+ ||<rowstyle="color:red"> /!\ Notice that I have moved the codeset to be the last match
in the query string. That is because the codeset string is allowed to contain dashes. This
was done to avoid issues with accidentally mistaking dashes in the codeset name with dashes
in the grammar.||
+ 
+ After doing the brace expansion, this function will get a list of installed locales and
their canonical representation strings. Then, for each of the brace expanded expressions,
the function will search for locales whose canonical representation matches the expression.
If the name is a match, the native locale name will be appended to a buffer that will be returned
to the user. Logic will exist to prevent the same locale from being accepted for more than
one matching expression.
+ 
- '''Perhaps we should consider adding an additional parameter to prepend the C/POSIX locales
as there is no way to match them using the canonical locale name matching rules we've laid
out above.'''
+ ||<rowstyle="color:red"> /!\ Perhaps we should consider adding an additional parameter
to prepend the C/POSIX locales as there is no way to match them using the canonical locale
name matching rules we've laid out above.||
+ 
+ The buffer returned by `rw_locale_query()` is owned by that function and is not to be dallocated
by the user. This buffer is currently planned to be left in use at program termination. If
it is deemed necessary, some additional code can be written to cleanup the buffer before program
exit, or we could require the user to deallocate the buffer when they are done with it.
  
  [[Anchor(References)]]
  = References =

Mime
View raw message