Mailing-List: contact stdcxx-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: stdcxx-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <46CB0542.5060109@roguewave.com>
Date: Tue, 21 Aug 2007 09:31:14 -0600
From: Martin Sebor <sebor@roguewave.com>
Organization: Rogue Wave Software, Inc.
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US;
 rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4
MIME-Version: 1.0
To: stdcxx-dev@incubator.apache.org
Subject: Re: expectation vs requirements for locale facets
References: <a648c5670708200331x6ca52965j9fb3d56e9334df37@mail.gmail.com>
In-Reply-To: <a648c5670708200331x6ca52965j9fb3d56e9334df37@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Travis Vitek wrote:
>> Martin Sebor wrote:
>>
>>
>> Yes. But notice the text doesn't say anything about time_put_byname or
>> time_get_byname ;-)
>>
> 
> Well, the standard doesn't say much at all about the *_byname<>
> facets. All it really says about them is
> 
>   [21.1.1.2 p4] For some standard facets a standard "..._byname" class,
[...]

The _byname requirements are extremely vague. Sometimes they are
also implied by the requirements on the base facets, which makes
them difficult to find. It's a mess.

> 
> So, if I'm reading that right, the *_byname<> facet classes are just
> there to prevent the user from having to instantiate a std::locale
> directly.

I'm not sure what you mean by this. The _byname facets are really
just an implementation that's exposed in the interface if the
locale library. They should have never been specified.

> 
>> The C++ standard (or even the C standard for that
>> matter) isn't going to of help here.
> 
> Wait. Say what now? I'm not sure what you're trying to tell me here.
> If the C++ Standard says that these facets read or write years as
> roman numerals, then they should probably do so, regardless of what
> any other standard document requires. I think this will actually get
> cleared up in a few seconds...

The C and C++ standards only specify the requirements on the "C"
locale and leave the localized behavior unspecified. So pretty
much anything goes. There are some ground rules but I suspect
you won't be able to tease the requirement on swallowing leading
space for the %e directive out of them.

> 
>>> Of
>>> course that isn't what I'm seeing.
>> Test case?
> 
> Yeah. See attachment. Only tested on Win32/VC8 and Linux/GCC.

Thanks. Here are the results with stdcxx and with g++ 3.4.6:

$ ./t.stdcxx | grep fail
string=07/06/08 result=fail     locale=thai
string= 7.06.1908       result=fail     locale=bg_BG
string=07/06/08 result=fail     locale=lo_LA
string=07/06/08 result=fail     locale=th_TH

$ ./t.gcc | grep fail
string=��� %.1d ��� 1908        result=fail     locale=ar_SA
string=۰۸/۰۶/۰۷ result=fail     locale=fa_IR
string=ಗುರುವಾರ 07 ಜೂ 1908       result=fail     locale=kn_IN

Looks like g++ is failing on multibyte character sequences but
not on the spaces. We seem to somehow manage to process the
multibyte sequences (I wonder how, or if it's a weakness in
the test) but have issues with the leading space in bg_BG.
I don't know what the problem is with the other locales...

> 
>> It's hard to say from just looking at the code (and I haven't looked
>> very carefully). In general, we [try to] to implement the POSIX
>> semantics, so if it works with strptime()/strftime() it should work
>> with our time_put_byname/ time_get_byname.
>>
> 
> Well, there's the problem right there. The standard requires that the
> time_put<> facet format its output according to the POSIX function
> strftime(), with the option for supporting extensions. It makes no
> indication that the time_get<> facet should read data in such a way as
> to be compatible with strptime(). The only thing I see that says
> anything about the format expecte by time_get<> is here...
[...]
> 

Right. Pretty vague.

> 
> This paragraph says that time_get<>::get_date() is supposed to process
> the output of time_put<>::put(..., 'x').
> 
>   [22.2.5.1.2 p4] Effects: Reads characters starting at  s until it has
>   extracted  those  struct tm members, and remaining format characters,
>   used by  time_put<>::put  to produce  the  format specified by 'x' or
>   until it encounters an error.

Yes. The problem with the C++ standard in this area is that the
requirements a vague and not always implementable (e.g., the
multibyte sequences -- all the narrow specializations of the
_get facets operate on single characters).

> 
>> If we test this behavior it's gotta be right ;-) Where does POSIX
>> say leading spaces must be skipped? I see this under %e: Equivalent
>> to %d. And under %d: The day of the month [01,31]; leading zeros
>> are permitted but not required. Nothing about ignoring spaces.
>>
> 
> Absolutely. The docs for POSIX strftime()...
[...]
> So strftime() isn't even compatible with strptime() when it comes to '%e'.

Hmm. That seems like a bug in POSIX then, unless we're missing
something. You might want to create a POSIX-only test case to
verify this and if I'm right open a discussion on the Austin
Group list (http://www.opengroup.org/austin/lists.html).

> 
[...]
> Unfortunately, without consistent input/output it is going to be
> difficult for this multi-threading test to verify that no data
> corruption is occuring with arbitrary locales. Hopefully there is some
> system in place that allows us to explicitly specify which locales are
> to be used for a test.

Not really. My approach would be to detect locales with this
problem and avoid using them. The test also doesn't need to
be exhaustive, at least not in this iteration. I think
exercising just the most common patterns should be good enough
(although %X is pretty common :)

Martin