incubator-stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liviu Nicoara <nikko...@hates.ms>
Subject Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Date Wed, 03 Oct 2012 13:01:33 GMT
On 10/02/12 10:41, Martin Sebor wrote:
> I haven't had time to look at this since my last email on
> Sunday. I also forgot about the string mutex. I don't think
> I'll have time to spend on this until later in the week.
> Unless the disassembly reveals the smoking gun, I think we
> might need to simplify the test to get to the bottom of the
> differences in our measurements. (I.e., eliminate the library
> and measure the runtime of a simple thread loop, with and
> without locking.) We should also look at the GLIBC and
> kernel versions on our systems, on the off chance that
> there has been a change that could explain the discrepancy
> between my numbers and yours. I suspect my system (RHEL 4.8)
> is much older than yours (I don't remember now if you posted
> your details).

I am gathering some more measurements along these lines but it's time 
consuming. I estimate I will have some ready for review later today or 
tomorrow. In the meantime could you please post your kernel, glibc and 
compiler versions?

Liviu

>
> Martin
>
> On 10/02/2012 06:22 AM, Liviu Nicoara wrote:
>> On 09/30/12 18:18, Martin Sebor wrote:
>>> I see you did a 64-bit build while I did a 32-bit one. so
>>> I tried 64-bits. The cached version (i.e., the one compiled
>>> with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
>>> as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).
>>>
>>> I had made one change to the test program that I thought might
>>> account for the difference: I removed the call to abort from
>>> the thread function since it was causing the process to exit
>>> prematurely in some of my tests. But since you used the
>>> modified program for your latest measurements that couldn't
>>> be it.
>>>
>>> I can't explain the differences. They just don't make sense
>>> to me. Your results should be the other way around. Can you
>>> post the disassembly of function f() for each of the two
>>> configurations of the test?
>>
>> The first thing that struck me in the cached `f' was that __string_ref
>> class uses a mutex for synchronizing access to the ref counter. It turns
>> out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
>> ops on the ref counter, via a block in rw/_config.h:
>>
>> # if _RWSTD_VER_MAJOR < 5
>> # ifdef _RWSTD_OS_LINUX
>> // on Linux/AMD64, unless explicitly requested, disable the use
>> // of atomic operations in string for binary compatibility with
>> // stdcxx 4.1.x
>> # ifndef _RWSTD_USE_STRING_ATOMIC_OPS
>> # define _RWSTD_NO_STRING_ATOMIC_OPS
>> # endif // _RWSTD_USE_STRING_ATOMIC_OPS
>> # endif // _WIN32
>> # endif // stdcxx < 5.0
>>
>>
>> That is not the cause for the performance difference, though. Even after
>> building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
>> performance with the non-cached version.
>>
>> Liviu


Mime
View raw message