Return-Path: Delivered-To: apmail-incubator-stdcxx-dev-archive@www.apache.org Received: (qmail 53265 invoked from network); 3 Jan 2008 01:54:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Jan 2008 01:54:28 -0000 Received: (qmail 94827 invoked by uid 500); 3 Jan 2008 01:54:17 -0000 Delivered-To: apmail-incubator-stdcxx-dev-archive@incubator.apache.org Received: (qmail 94813 invoked by uid 500); 3 Jan 2008 01:54:16 -0000 Mailing-List: contact stdcxx-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: stdcxx-dev@incubator.apache.org Delivered-To: mailing list stdcxx-dev@incubator.apache.org Received: (qmail 94802 invoked by uid 99); 3 Jan 2008 01:54:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jan 2008 17:54:16 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of msebor@gmail.com designates 209.85.146.178 as permitted sender) Received: from [209.85.146.178] (HELO wa-out-1112.google.com) (209.85.146.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jan 2008 01:53:51 +0000 Received: by wa-out-1112.google.com with SMTP id n4so24243226wag.6 for ; Wed, 02 Jan 2008 17:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:organization:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding:sender; bh=Ql46pvvrRgiEnmvEFhCVIYctbx4iGaqOB2vqAiyIRzo=; b=jc5PRFT6+/vaBLt1nu1j3IE+DOxsbxgqJpsC59xgRiK8zXJP38l/gqyaohHs11fMNKZeoX7QjijrkRJEoGY6nj+44tdeKBxEFi2RpCKzq3xDBruM5UnxU5EwYKWydzNsvfFpJjfpKQvLfD6+2QfwwotvZPb5hSD9K9/svzxAQlc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:organization:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding:sender; b=JlaUx9uEOJI+Il805kV16MyTDESgP7sNKno4os85/BBKQzYPOvgX74dvBeYvMiNeJehkGI9Y3Hwj1ziMqXnsAeFNCTgM4y2SBmANphE5YMNgQHHCrHFb0p9baWyXjmQOpqnF4AyVXGuktHJKHIOv8aPW6d3GqZay0flGDlo0YE8= Received: by 10.115.54.1 with SMTP id g1mr624991wak.133.1199325236012; Wed, 02 Jan 2008 17:53:56 -0800 (PST) Received: from localhost.localdomain ( [71.229.200.170]) by mx.google.com with ESMTPS id q18sm23931561pog.12.2008.01.02.17.53.53 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 02 Jan 2008 17:53:54 -0800 (PST) Message-ID: <477C4030.4030507@roguewave.com> Date: Wed, 02 Jan 2008 18:53:52 -0700 From: Martin Sebor Organization: Rogue Wave Software, Inc. User-Agent: Thunderbird 2.0.0.9 (X11/20071115) MIME-Version: 1.0 To: stdcxx-dev@incubator.apache.org Subject: Re: low hanging fruit while cleaning up test failures References: <47321834.8050203@roguewave.com> <13637191.post@talk.nabble.com> <47323B7D.9030104@roguewave.com> <13659097.post@talk.nabble.com> <47348DC2.8070703@roguewave.com> <14457668.post@talk.nabble.com> <14561904.post@talk.nabble.com> <477BC336.20004@roguewave.com> <14584842.post@talk.nabble.com> <477BF27B.5060308@roguewave.com> <14585914.post@talk.nabble.com> <477C04D7.7090002@roguewave.com> <477C3958.5010405@roguewave.com> In-Reply-To: <477C3958.5010405@roguewave.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: Martin Sebor X-Virus-Checked: Checked by ClamAV on apache.org Martin Sebor wrote: > [forwarding back to the list] > > Travis Vitek wrote: > > Martin, > > > > Not all supported platforms have the GB18030 encoding [HP, Compaq and > > IRIX don't], and of those that do, they use different names [gb18030 vs > > GB18030]. Same with UTF-8 [utf8, UTF8 or UTF-8]. > > I realize that. That's the reason why I mentioned the (currently quite > inefficient) find_mb_locale() in 22.locale.codecvt.out.cpp: it looks > for any multibyte locale with MB_CUR_MAX of some value. [Warning: completely unbaked brainstorming follows...] Btw., it might help to think of this not in terms of locales or code sets or the names of such things but rather in terms of the parameters that they have in common or that distinguish them, and that our tests are trying to exploit. The names are just handles, or shortcuts, that let us refer to these sets of attributes. Maybe the right interface is some kind of SQL-like language that lets us refer to the attributes individually or in Boolean expressions. Each codeset has three fundamental attributes: Since code_set_name is platform-specific, we could add another called for standard name assigned by IANA: http://www.iana.org/assignments/character-sets There might be others (e.g., the highest wchar_t value). So maybe the ideal interface we're looking for is simply a function that takes a query, like this: const char* find_codeset(const char *query); with query being an expression that lets us refer to the attributes. For instance, find_codeset("std_name == GB18030") would return the platform-specific name of the codeset corresponding to GB18030, and find_codeset("mb_cur_max >= 2") would return a string of names of all platform-specific codeset names with MB_CUR_MAX >= 2 holds. Names would be able to use fnmatch()-style regular expressions. The query string could contain printf-like directives to make it easy to compose expressions parametrized on run-time values. Would this interface make the locale tests easier to write? Would the expression parser/evaluator be fairly easy to implement? Martin > > > On top of that, Windows > > uses code page numbers instead of encodings. Now I can easily convert > > names to uppercase and strip out non alphanumeric characters. I might > > even be able to do something with windows. That isn't the problem. > > > > The problem is that I don't seem to have a clear understanding of what > > exactly you want from this. The original proposal was to be able to > > filter locales by name or encoding, like so... > > > > char* locales = rw_locales (_RWSTD_LC_ALL, "en_US de", 0, true); > > // would retrieve C en_US and all German locales > > > > char* locales = rw_locales (_RWSTD_LC_ALL, 0, "UTF-8", true); > > // would retrieve all UTF-8 locales [those that end in .utf8, .UTF-8 > > or .UTF8] > > > > So I wrote that. Unfortunately, as mentioned above, it has limitations. > > Right. It wasn't a thought out proposal. I was just brainstorming :) > We may not be able to use any of it to fix the hanging tests. > > > What I really need is some actual requirements so that I can write some > > code and get this bug closed. > > My only requirement is to get those tests to pass in a reasonable > amount of time (i.e., without timing out), and without compromising > their effectiveness. > > > > > It seems that you want to guarantee that we test multibyte locales. > > It seems important to exercise ctype::do_narrow() in this case but > I haven't looked at the code very carefully. It could be that the > code path in the multibyte case isn't any different from the single > byte case. > > > Do > > we want to give up on the locale name matching, or do we want to include > > zh_CN in the list of locales to test? What about matching the encoding? > > Should we ignore all of this and just find one locale for each value of > > MB_CUR_MAX from 1 to MB_LEN_MAX and run the test on them? > > Maybe. I'll let you propose what makes the most sense to you :) > > Martin > > > > > Travis > > > > > > > > > >> -----Original Message----- > >> From: Martin Sebor [mailto:msebor@gmail.com] On Behalf Of Martin Sebor > >> Sent: Wednesday, January 02, 2008 1:41 PM > >> To: stdcxx-dev@incubator.apache.org > >> Subject: Re: low hanging fruit while cleaning up test failures > >> > >> Travis Vitek wrote: > >>> > >>> Martin Sebor wrote: > >>>> Travis Vitek wrote: > >>>>> Martin Sebor wrote: > >>>>>> Travis Vitek wrote: > >>>>>>> Martin Sebor wrote: > >>>>>>>> I added a new function, rw_fnmatch(), to the test > >> driver. It behaves > >>>>>>>> just > >>>>>>>> like the POSIX fnmatch() (the FNM_XXX constants aren't > >> implemented > >>>>>>>> yet). While the main purpose behind the new function is > >> to support > >>>>>>>> STDCXX-683 it should make it easier to also implement a > >> scheme like > >>>>>>>> the one outlined below. > >>>>>>>> > >>>>>>>> Travis, feel free to experiment/prototype a solution :) > >>>>>>>> > >>>>>>>> Martin > >>>>>>>> > >>>>>>> What expression should be used to get an appropriate set > >> of locales for > >>>>>>> a > >>>>>>> given platform? I can't really expect a filter for all > >> UTF-8 locales to > >>>>>>> work > >>>>>>> on all platforms as some don't have those encodings > >> available at all. > >>>>>>> If > >>>>>>> I > >>>>>>> filter by language, then I may be limiting the testing > >> to some always > >>>>>>> correct subset. Is that acceptable for the MT tests? > >>>>>> I think the MT ctype tests just need to exercise a representative > >>>>>> sample of multi-byte encodings (i.e., MB_CUR_MAX between 1 and > >>>>>> MB_LEN_MAX). There already is some code in the test suite to find > >>>>>> locales that use these encodings, although it could be made more > >>>>>> efficient. I don't know how useful rw_fnmatch() will turn out to > >>>>>> be in finding these codesets since their names don't matter. > >>>>>> > >>>>>> Martin > >>>>>> > >>>>>>> Travis > >>>>> Actually, I think I meant to say single threaded tests. > >> Those are the > >>>>> ones > >>>>> that currently test every locale. The multi-threadede > >> tests already test > >>>>> a > >>>>> subset of locales, though the method for selecting those > >> locales may vary > >>>>> between tests. > >>>>> > >>>>> I don't think it is right to test a fixed set of locales based on > >>>>> language, > >>>>> country, or encoding. If you agree, then we probably agree that the > >>>>> proposed > >>>>> enhancement doesn't actually do anything useful [and I've > >> wasted a bunch > >>>>> of > >>>>> time]. If this is the case, then we need to propose > >> another solution for > >>>>> selecting locales. > >>>> I think testing a small subset of installed locales should > >> be enough. > >>>> In fact, for white box testing of the ctype facets, exercising three > >>>> locales, "C" and two named ones, should be sufficient. > >>>> > >>>>> If I am wrong, and it is useful for testing [and more > >> specifically how it > >>>>> would be useful for fixing STDCXX-608], then I'd like to hear how. > >>>> What do you propose? > >>>> > >>>> Martin > >>>> > >>>> > >>> Okay. I can live with that. Then the issue now becomes deciding which > >>> additional locales to test. How about just testing all > >> Spanish and German > >>> locales? > >> I'd make sure at least one of them uses a multibyte encoding. Maybe > >> zh_CN.GB18030? (with MB_CUR_MAX of 4)? > >> > >> Martin > >> > > > > >