incubator-stdcxx-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Sebor <mse...@gmail.com>
Subject Re: Fwd: Re: STDCXX-1071 numpunct facet defect
Date Sun, 30 Sep 2012 22:18:57 GMT
I see you did a 64-bit build while I did a 32-bit one. so
I tried 64-bits. The cached version (i.e., the one compiled
with -UNO_USE_NUMPUNCT_CACHE) is still about twice as fast
as the non-cached one (compiled with -DNO_USE_NUMPUNCT_CACHE).

I had made one change to the test program that I thought might
account for the difference: I removed the call to abort from
the thread function since it was causing the process to exit
prematurely in some of my tests. But since you used the
modified program for your latest measurements that couldn't
be it.

I can't explain the differences. They just don't make sense
to me. Your results should be the other way around. Can you
post the disassembly of function f() for each of the two
configurations of the test?

Martin

On 09/30/2012 03:30 PM, Liviu Nicoara wrote:
> On 9/30/12 2:21 PM, Liviu Nicoara wrote:
>> Forwarding with the attachment.
>>
>> -------- Original Message --------
>> Subject: Re: STDCXX-1071 numpunct facet defect
>> Date: Sun, 30 Sep 2012 12:09:10 -0600
>> From: Martin Sebor <msebor@gmail.com>
>> To: Liviu Nicoara <nikkoara@hates.ms>
>>
>>> On 9/27/12 8:27 PM, Martin Sebor wrote:
>>
>> Here are my timings for library-reduction.cpp when compiled
>> GCC 4.5.3 on Solaris 10 (4 SPARCV9 CPUs). I had to make a small
>> number of trivial changes to get it to compile:
>>
>> With cache No cache
>> real 1m38.332s 8m58.568s
>> user 6m30.244s 34m25.942s
>> sys 0m0.060s 0m3.922s
>>
>> I also experimented with the program on Linux (CEL 4 with 16
>> CPUs). Initially, I saw no differences between the two versions.
>> So I modified it a bit to make it closer to the library (the
>> modified program is attached). With those changes the timings
>
> I see the difference -- your program has a virtual function it calls
> from the inline grouping function.
>
>> are below:
>>
>> With cache No cache
>> real 0m 1.107s 0m 5.669s
>> user 0m17.204s 0m 5.669s
>> sys 0m 0.000s 0m22.347s
>>
>> I also recompiled and re-ran the test on Solaris. To speed
>> things along, I set the number threads and loops to 8 and
>> 1000000. The numbers are as follows:
>>
>> With cache No cache
>> real 0m3.341s 0m26.333s
>> user 0m13.052s 1m37.470s
>> sys 0m0.009s 0m0.132s
>>
>> The numbers match my expectation. The overhead without the
>> "numpunct cache" is considerable.
>
> I have done another (smaller) round of measurements, this time using the
> test program you posted. Here are the results:
>
> * iMac, 4x Intel, 12S:
>
> 16, 10000000:
>
> Cached Not cached
> real 0m9.300s 0m5.224s
> user 0m36.441s 0m20.523s
> sys 0m0.043s 0m0.068s
>
> * iMac, 4x Intel, 12D:
>
> Cached Not cached
> real 0m9.012s 0m5.774s
> user 0m35.343s 0m20.997s
> sys 0m0.045s 0m0.183s
>
> * Linux Slackware, 16x AMD Opteron, 12S:
>
> 16, 10000000:
>
> Cached Not cached
> real 0m29.798s 0m3.278s
> user 0m48.662s 0m47.338s
> sys 6m18.525s 0m3.298s
>
>>
>> Somewhat unexpectedly, the test with the cache didn't crash.
>
> On my iMac it did not crash for me either (gcc 4.5.4), this time. On the
> other box (gcc 4.5.2) crashed every time with caching, so I had to add a
> call to fac.grouping outside the thread function to initialize the "facet".
>
> Liviu


Mime
View raw message