incubator-stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Sebor <mse...@gmail.com>
Subject Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Date Wed, 03 Oct 2012 15:10:58 GMT
On 10/03/2012 07:01 AM, Liviu Nicoara wrote:
> On 10/02/12 10:41, Martin Sebor wrote:
>> I haven't had time to look at this since my last email on
>> Sunday. I also forgot about the string mutex. I don't think
>> I'll have time to spend on this until later in the week.
>> Unless the disassembly reveals the smoking gun, I think we
>> might need to simplify the test to get to the bottom of the
>> differences in our measurements. (I.e., eliminate the library
>> and measure the runtime of a simple thread loop, with and
>> without locking.) We should also look at the GLIBC and
>> kernel versions on our systems, on the off chance that
>> there has been a change that could explain the discrepancy
>> between my numbers and yours. I suspect my system (RHEL 4.8)
>> is much older than yours (I don't remember now if you posted
>> your details).
>
> I am gathering some more measurements along these lines but it's time
> consuming. I estimate I will have some ready for review later today or
> tomorrow. In the meantime could you please post your kernel, glibc and
> compiler versions?

I was just thinking of a few simple loops along the lines of:

   void* thread_func (void*) {
       for (int i = 0; i < N; ++)
           test 1: do some simple stuff inline
           test 2: call a virtual function to do the same stuff
           test 3: lock and unlock a mutex and do the same stuff
   }

Test 1 should be the fastest and test 3 the slowest. This should
hold regardless of what "simple stuff" is (eventually, even when
it's getting numpunct::grouping() data).

For the Linux tests I used a 16 CPU (Xeon X5570 @ 3GHz) box with
RHEL 4.8 with 2.6.9-89.0.11.ELlargesmp, GLIBC version is 2.3.4,
and GCC 3.4.6.

Martin

>
> Liviu
>
>>
>> Martin
>>
>> On 10/02/2012 06:22 AM, Liviu Nicoara wrote:
>>> On 09/30/12 18:18, Martin Sebor wrote:
>>>> I see you did a 64-bit build while I did a 32-bit one. so
>>>> I tried 64-bits. The cached version (i.e., the one compiled
>>>> with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
>>>> as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).
>>>>
>>>> I had made one change to the test program that I thought might
>>>> account for the difference: I removed the call to abort from
>>>> the thread function since it was causing the process to exit
>>>> prematurely in some of my tests. But since you used the
>>>> modified program for your latest measurements that couldn't
>>>> be it.
>>>>
>>>> I can't explain the differences. They just don't make sense
>>>> to me. Your results should be the other way around. Can you
>>>> post the disassembly of function f() for each of the two
>>>> configurations of the test?
>>>
>>> The first thing that struck me in the cached `f' was that __string_ref
>>> class uses a mutex for synchronizing access to the ref counter. It turns
>>> out, for Linux on AMD64 we explicitly use a mutex instead of the atomic
>>> ops on the ref counter, via a block in rw/_config.h:
>>>
>>> # if _RWSTD_VER_MAJOR < 5
>>> # ifdef _RWSTD_OS_LINUX
>>> // on Linux/AMD64, unless explicitly requested, disable the use
>>> // of atomic operations in string for binary compatibility with
>>> // stdcxx 4.1.x
>>> # ifndef _RWSTD_USE_STRING_ATOMIC_OPS
>>> # define _RWSTD_NO_STRING_ATOMIC_OPS
>>> # endif // _RWSTD_USE_STRING_ATOMIC_OPS
>>> # endif // _WIN32
>>> # endif // stdcxx < 5.0
>>>
>>>
>>> That is not the cause for the performance difference, though. Even after
>>> building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better
>>> performance with the non-cached version.
>>>
>>> Liviu
>


Mime
View raw message