Mailing-List: contact stdcxx-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: stdcxx-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of msebor@gmail.com designates
 209.85.146.178 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:organization:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding:sender;
        b=JlaUx9uEOJI+Il805kV16MyTDESgP7sNKno4os85/BBKQzYPOvgX74dvBeYvMiNeJehkGI9Y3Hwj1ziMqXnsAeFNCTgM4y2SBmANphE5YMNgQHHCrHFb0p9baWyXjmQOpqnF4AyVXGuktHJKHIOv8aPW6d3GqZay0flGDlo0YE8=
Message-ID: <477C4030.4030507@roguewave.com>
Date: Wed, 02 Jan 2008 18:53:52 -0700
From: Martin Sebor <sebor@roguewave.com>
Organization: Rogue Wave Software, Inc.
User-Agent: Thunderbird 2.0.0.9 (X11/20071115)
MIME-Version: 1.0
To: stdcxx-dev@incubator.apache.org
Subject: Re: low hanging fruit while cleaning up test failures
References: <47321834.8050203@roguewave.com> <13637191.post@talk.nabble.com>
 <47323B7D.9030104@roguewave.com> <13659097.post@talk.nabble.com>
 <47348DC2.8070703@roguewave.com> <14457668.post@talk.nabble.com>
 <14561904.post@talk.nabble.com> <477BC336.20004@roguewave.com>
 <14584842.post@talk.nabble.com> <477BF27B.5060308@roguewave.com>
 <14585914.post@talk.nabble.com> <477C04D7.7090002@roguewave.com>
 <477C3958.5010405@roguewave.com>
In-Reply-To: <477C3958.5010405@roguewave.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: Martin Sebor <msebor@gmail.com>

Martin Sebor wrote:
> [forwarding back to the list]
> 
> Travis Vitek wrote:
>  > Martin,
>  >
>  > Not all supported platforms have the GB18030 encoding [HP, Compaq and
>  > IRIX don't], and of those that do, they use different names [gb18030 vs
>  > GB18030]. Same with UTF-8 [utf8, UTF8 or UTF-8].
> 
> I realize that. That's the reason why I mentioned the (currently quite
> inefficient) find_mb_locale() in 22.locale.codecvt.out.cpp: it looks
> for any multibyte locale with MB_CUR_MAX of some value.

[Warning: completely unbaked brainstorming follows...]

Btw., it might help to think of this not in terms of locales or code
sets or the names of such things but rather in terms of the parameters
that they have in common or that distinguish them, and that our tests
are trying to exploit. The names are just handles, or shortcuts, that
let us refer to these sets of attributes. Maybe the right interface
is some kind of SQL-like language that lets us refer to the attributes
individually or in Boolean expressions.

Each codeset has three fundamental attributes:

   <code_set_name>
   <mb_cur_max>
   <mb_cur_min>

Since code_set_name is platform-specific, we could add another called
<std_name> for standard name assigned by IANA:
   http://www.iana.org/assignments/character-sets

There might be others (e.g., the highest wchar_t value).

So maybe the ideal interface we're looking for is simply a function
that takes a query, like this:

   const char* find_codeset(const char *query);

with query being an expression that lets us refer to the attributes.
For instance, find_codeset("std_name == GB18030") would return the
platform-specific name of the codeset corresponding to GB18030, and
find_codeset("mb_cur_max >= 2") would return a string of names of
all platform-specific codeset names with MB_CUR_MAX >= 2 holds.
Names would be able to use fnmatch()-style regular expressions.
The query string could contain printf-like directives to make it
easy to compose expressions parametrized on run-time values.

Would this interface make the locale tests easier to write? Would
the expression parser/evaluator be fairly easy to implement?

Martin

> 
>  > On top of that, Windows
>  > uses code page numbers instead of encodings. Now I can easily convert
>  > names to uppercase and strip out non alphanumeric characters. I might
>  > even be able to do something with windows. That isn't the problem.
>  >
>  > The problem is that I don't seem to have a clear understanding of what
>  > exactly you want from this. The original proposal was to be able to
>  > filter locales by name or encoding, like so...
>  >
>  >     char* locales = rw_locales (_RWSTD_LC_ALL, "en_US de", 0, true);
>  >     // would retrieve C en_US and all German locales
>  >
>  >     char* locales = rw_locales (_RWSTD_LC_ALL, 0, "UTF-8", true);
>  >     // would retrieve all UTF-8 locales [those that end in .utf8, .UTF-8
>  > or .UTF8]
>  >
>  > So I wrote that. Unfortunately, as mentioned above, it has limitations.
> 
> Right. It wasn't a thought out proposal. I was just brainstorming :)
> We may not be able to use any of it to fix the hanging tests.
> 
>  > What I really need is some actual requirements so that I can write some
>  > code and get this bug closed.
> 
> My only requirement is to get those tests to pass in a reasonable
> amount of time (i.e., without timing out), and without compromising
> their effectiveness.
> 
>  >
>  > It seems that you want to guarantee that we test multibyte locales.
> 
> It seems important to exercise ctype::do_narrow() in this case but
> I haven't looked at the code very carefully. It could be that the
> code path in the multibyte case isn't any different from the single
> byte case.
> 
>  > Do
>  > we want to give up on the locale name matching, or do we want to include
>  > zh_CN in the list of locales to test? What about matching the encoding?
>  > Should we ignore all of this and just find one locale for each value of
>  > MB_CUR_MAX from 1 to MB_LEN_MAX and run the test on them?
> 
> Maybe. I'll let you propose what makes the most sense to you :)
> 
> Martin
> 
>  >
>  > Travis
>  >
>  >
>  >
>  >
>  >> -----Original Message-----
>  >> From: Martin Sebor [mailto:msebor@gmail.com] On Behalf Of Martin Sebor
>  >> Sent: Wednesday, January 02, 2008 1:41 PM
>  >> To: stdcxx-dev@incubator.apache.org
>  >> Subject: Re: low hanging fruit while cleaning up test failures
>  >>
>  >> Travis Vitek wrote:
>  >>>
>  >>> Martin Sebor wrote:
>  >>>> Travis Vitek wrote:
>  >>>>> Martin Sebor wrote:
>  >>>>>> Travis Vitek wrote:
>  >>>>>>> Martin Sebor wrote:
>  >>>>>>>> I added a new function, rw_fnmatch(), to the test
>  >> driver. It behaves
>  >>>>>>>> just
>  >>>>>>>> like the POSIX fnmatch() (the FNM_XXX constants aren't
>  >> implemented
>  >>>>>>>> yet). While the main purpose behind the new function is
>  >> to support
>  >>>>>>>> STDCXX-683 it should make it easier to also implement a
>  >> scheme like
>  >>>>>>>> the one outlined below.
>  >>>>>>>>
>  >>>>>>>> Travis, feel free to experiment/prototype a solution :)
>  >>>>>>>>
>  >>>>>>>> Martin
>  >>>>>>>>
>  >>>>>>> What expression should be used to get an appropriate set
>  >> of locales for
>  >>>>>>> a
>  >>>>>>> given platform? I can't really expect a filter for all
>  >> UTF-8 locales to
>  >>>>>>> work
>  >>>>>>> on all platforms as some don't have those encodings
>  >> available at all.
>  >>>>>>> If
>  >>>>>>> I
>  >>>>>>> filter by language, then I may be limiting the testing
>  >> to some always
>  >>>>>>> correct subset. Is that acceptable for the MT tests?
>  >>>>>> I think the MT ctype tests just need to exercise a representative
>  >>>>>> sample of multi-byte encodings (i.e., MB_CUR_MAX between 1 and
>  >>>>>> MB_LEN_MAX). There already is some code in the test suite to find
>  >>>>>> locales that use these encodings, although it could be made more
>  >>>>>> efficient. I don't know how useful rw_fnmatch() will turn out to
>  >>>>>> be in finding these codesets since their names don't matter.
>  >>>>>>
>  >>>>>> Martin
>  >>>>>>
>  >>>>>>> Travis
>  >>>>> Actually, I think I meant to say single threaded tests.
>  >> Those are the
>  >>>>> ones
>  >>>>> that currently test every locale. The multi-threadede
>  >> tests already test
>  >>>>> a
>  >>>>> subset of locales, though the method for selecting those
>  >> locales may vary
>  >>>>> between tests.
>  >>>>>
>  >>>>> I don't think it is right to test a fixed set of locales based on
>  >>>>> language,
>  >>>>> country, or encoding. If you agree, then we probably agree that the
>  >>>>> proposed
>  >>>>> enhancement doesn't actually do anything useful [and I've
>  >> wasted a bunch
>  >>>>> of
>  >>>>> time]. If this is the case, then we need to propose
>  >> another solution for
>  >>>>> selecting locales.
>  >>>> I think testing a small subset of installed locales should
>  >> be enough.
>  >>>> In fact, for white box testing of the ctype facets, exercising three
>  >>>> locales, "C" and two named ones, should be sufficient.
>  >>>>
>  >>>>> If I am wrong, and it is useful for testing [and more
>  >> specifically how it
>  >>>>> would be useful for fixing STDCXX-608], then I'd like to hear how.
>  >>>> What do you propose?
>  >>>>
>  >>>> Martin
>  >>>>
>  >>>>
>  >>> Okay. I can live with that. Then the issue now becomes deciding which
>  >>> additional locales to test. How about just testing all
>  >> Spanish and German
>  >>> locales?
>  >> I'd make sure at least one of them uses a multibyte encoding. Maybe
>  >> zh_CN.GB18030? (with MB_CUR_MAX of 4)?
>  >>
>  >> Martin
>  >>
>  >
> 
> 
>