On 10/03/2012 07:01 AM, Liviu Nicoara wrote: > On 10/02/12 10:41, Martin Sebor wrote: >> I haven't had time to look at this since my last email on >> Sunday. I also forgot about the string mutex. I don't think >> I'll have time to spend on this until later in the week. >> Unless the disassembly reveals the smoking gun, I think we >> might need to simplify the test to get to the bottom of the >> differences in our measurements. (I.e., eliminate the library >> and measure the runtime of a simple thread loop, with and >> without locking.) We should also look at the GLIBC and >> kernel versions on our systems, on the off chance that >> there has been a change that could explain the discrepancy >> between my numbers and yours. I suspect my system (RHEL 4.8) >> is much older than yours (I don't remember now if you posted >> your details). > > I am gathering some more measurements along these lines but it's time > consuming. I estimate I will have some ready for review later today or > tomorrow. In the meantime could you please post your kernel, glibc and > compiler versions? I was just thinking of a few simple loops along the lines of: void* thread_func (void*) { for (int i = 0; i < N; ++) test 1: do some simple stuff inline test 2: call a virtual function to do the same stuff test 3: lock and unlock a mutex and do the same stuff } Test 1 should be the fastest and test 3 the slowest. This should hold regardless of what "simple stuff" is (eventually, even when it's getting numpunct::grouping() data). For the Linux tests I used a 16 CPU (Xeon X5570 @ 3GHz) box with RHEL 4.8 with 2.6.9-89.0.11.ELlargesmp, GLIBC version is 2.3.4, and GCC 3.4.6. Martin > > Liviu > >> >> Martin >> >> On 10/02/2012 06:22 AM, Liviu Nicoara wrote: >>> On 09/30/12 18:18, Martin Sebor wrote: >>>> I see you did a 64-bit build while I did a 32-bit one. so >>>> I tried 64-bits. The cached version (i.e., the one compiled >>>> with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast >>>> as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE). >>>> >>>> I had made one change to the test program that I thought might >>>> account for the difference: I removed the call to abort from >>>> the thread function since it was causing the process to exit >>>> prematurely in some of my tests. But since you used the >>>> modified program for your latest measurements that couldn't >>>> be it. >>>> >>>> I can't explain the differences. They just don't make sense >>>> to me. Your results should be the other way around. Can you >>>> post the disassembly of function f() for each of the two >>>> configurations of the test? >>> >>> The first thing that struck me in the cached `f' was that __string_ref >>> class uses a mutex for synchronizing access to the ref counter. It turns >>> out, for Linux on AMD64 we explicitly use a mutex instead of the atomic >>> ops on the ref counter, via a block in rw/_config.h: >>> >>> # if _RWSTD_VER_MAJOR < 5 >>> # ifdef _RWSTD_OS_LINUX >>> // on Linux/AMD64, unless explicitly requested, disable the use >>> // of atomic operations in string for binary compatibility with >>> // stdcxx 4.1.x >>> # ifndef _RWSTD_USE_STRING_ATOMIC_OPS >>> # define _RWSTD_NO_STRING_ATOMIC_OPS >>> # endif // _RWSTD_USE_STRING_ATOMIC_OPS >>> # endif // _WIN32 >>> # endif // stdcxx < 5.0 >>> >>> >>> That is not the cause for the performance difference, though. Even after >>> building with __RWSTD_USE_STRING_ATOMIC_OPS I get the same better >>> performance with the non-cached version. >>> >>> Liviu >