From stdcxx-dev-return-4774-apmail-incubator-stdcxx-dev-archive=incubator.apache.org@incubator.apache.org Wed Sep 05 19:16:37 2007 Return-Path: Delivered-To: apmail-incubator-stdcxx-dev-archive@www.apache.org Received: (qmail 67825 invoked from network); 5 Sep 2007 19:16:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Sep 2007 19:16:37 -0000 Received: (qmail 87023 invoked by uid 500); 5 Sep 2007 19:16:31 -0000 Delivered-To: apmail-incubator-stdcxx-dev-archive@incubator.apache.org Received: (qmail 87012 invoked by uid 500); 5 Sep 2007 19:16:31 -0000 Mailing-List: contact stdcxx-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: stdcxx-dev@incubator.apache.org Delivered-To: mailing list stdcxx-dev@incubator.apache.org Received: (qmail 87001 invoked by uid 99); 5 Sep 2007 19:16:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 12:16:31 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.30.140.160] (HELO moroha.quovadx.com) (208.30.140.160) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Sep 2007 19:17:46 +0000 Received: from qxvcexch01.ad.quovadx.com ([192.168.170.59]) by moroha.quovadx.com (8.13.6/8.13.6) with ESMTP id l85JFtwV014983 for ; Wed, 5 Sep 2007 19:15:55 GMT X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH] Use __rw_atomic_xxx() on Windows Date: Wed, 5 Sep 2007 13:16:26 -0600 Message-ID: In-Reply-To: <46DE02EB.6010707@roguewave.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH] Use __rw_atomic_xxx() on Windows Thread-Index: AcfvWyrhrb4bwGLsQGSXEYZiI0JPNQAkVKjQ References: <46D453D9.8010503@epam.com> <46D64CE0.4000905@roguewave.com> <46DE02EB.6010707@roguewave.com> From: "Travis Vitek" To: X-Virus-Checked: Checked by ClamAV on apache.org Oh, yeah. that is the other thing that I did Friday. I wrote a testcase to compare __rw_atomic_add32() against InterlockedIncrement() on Win32. There is a performance penalty... C:\Temp>t 2 && t 4 && t 8 ---------- locked inc ---- atomic_add ---- 2 threads ms 4266 4469 ms/op 0.00003178 0.00003330 -4.7586% thr ms 18117 18437 thr ms/op 0.00013498 0.00013737 -1.7663% ---------- locked inc ---- atomic_add ---- 4 threads ms 7969 8609 ms/op 0.00005937 0.00006414 -8.0311% thr ms 36359 37019 thr ms/op 0.00027090 0.00027581 -1.8152% ---------- locked inc ---- atomic_add ---- 8 threads ms 5016 5484 ms/op 0.00003737 0.00004086 -9.3301% thr ms 60846 66130 thr ms/op 0.00045334 0.00049271 -8.6842% C:\Temp>t 2 && t 4 && t 8 ---------- locked inc ---- atomic_add ---- 2 threads ms 2781 2906 ms/op 0.00002072 0.00002165 -4.4948% thr ms 14961 16093 thr ms/op 0.00011147 0.00011990 -7.5663% ---------- locked inc ---- atomic_add ---- 4 threads ms 2781 2891 ms/op 0.00002072 0.00002154 -3.9554% thr ms 30867 31328 thr ms/op 0.00022998 0.00023341 -1.4935% ---------- locked inc ---- atomic_add ---- 8 threads ms 2782 2890 ms/op 0.00002073 0.00002153 -3.8821% thr ms 64318 64341 thr ms/op 0.00047921 0.00047938 -0.0358% I will do a quick run using the string performance test after lunch. I'll report the results on that later. I've pasted the source for the bulk of my test below. If someone wants the entire thing, let me know and I'll provide everything. Travis Martin Sebor wrote: >Subject: Re: [PATCH] Use __rw_atomic_xxx() on Windows > >What's the status of this? We need to decide if we can put this >in 4.2 or defer it for 4.2.1. To put it in 4.2 we need to make >sure the new functions don't cause a performance regression in >basic_string. I.e., we need to see the before and after numbers. > >Martin > >Martin Sebor wrote: >> >> One concern I have is performance. Does replacing the intrinsics with >> out of line function call whose semantics the compiler has no idea >> about have any impact on the runtime efficiency of the=20 >generated code? >> I would be especially interested in "real life" scenarios such as the >> usage of the atomic operations in basic_string. >>=20 >> It would be good to see some before and after numbers. If you don't >> have all the platforms to run the test post your benchmark and Travis >> can help you put them together. > #include #include #define WIN32_LEAN_AND_MEAN #include #include #include "lib.h" #define MIN_THREADS 2 #define MAX_THREADS 16 unsigned long locked_inc(long* val, long iters) { const unsigned long t0 =3D GetTickCount (); long n; for (n =3D 0; n < iters; ++n) { InterlockedIncrement(val); } const unsigned long t1 =3D GetTickCount (); return (t1 - t0); } unsigned long atomic_add(long* val, long iters) { const unsigned long t0 =3D GetTickCount (); long n; for (n =3D 0; n < iters; ++n) { __rw_atomic_add32(val, 1); } const unsigned long t1 =3D GetTickCount (); return (t1 - t0); } struct thread_param { // atomic variable long* variable; // number of iterations long iters; // function to invoke unsigned long (*fun)(long*, long); // result of function unsigned long result; // thread handle used by main thread HANDLE thread; }; extern "C" { void thread_func(void* p) { thread_param* param =3D (thread_param*)p; param->result =3D (param->fun)(param->variable, param->iters); } } // extern "C" unsigned long run_threads(int nthreads, unsigned long (*fun)(long*, long), long iters) { thread_param params[MAX_THREADS]; long thread_var =3D 0; int i; for (i =3D 0; i < nthreads; ++i) { params[i].variable =3D &thread_var; params[i].result =3D 0; params[i].fun =3D fun; params[i].iters =3D iters; } int n; for (n =3D 0; n < nthreads; ++n) { params[n].thread =3D (HANDLE)_beginthread(thread_func, 0, ¶ms[n]); } unsigned long thread_time =3D 0; for (n =3D 0; n < nthreads; ++n) { WaitForSingleObject (params[n].thread, INFINITE); thread_time +=3D params[n].result; } return thread_time; } int main(int argc, char* argv[]) { int nthreads =3D MIN_THREADS; if (1 < argc) nthreads =3D atoi(argv[1]); // cap thread count if (nthreads < MIN_THREADS) nthreads =3D MIN_THREADS; else if (MAX_THREADS < nthreads) nthreads =3D MAX_THREADS; const long ops =3D 0x7ffffff; long thread_var; =20 thread_var =3D 0; unsigned long locked_inc_ms =3D locked_inc (&thread_var, ops); =20 thread_var =3D 0; unsigned long atomic_add_ms =3D atomic_add (&thread_var, ops); printf("---------- locked inc ---- atomic_add ---- %d threads\n", nthreads); printf("ms %8.u %8.u\n", locked_inc_ms, atomic_add_ms); float locked_inc_ops_p_ms =3D 1.f * locked_inc_ms / ops; float atomic_add_ops_p_ms =3D 1.f * atomic_add_ms / ops; printf("ms/op %8.8f %8.8f %.4f%%\n",=20 locked_inc_ops_p_ms, atomic_add_ops_p_ms, 100.f * (locked_inc_ops_p_ms - atomic_add_ops_p_ms) / locked_inc_ops_p_ms); // do it with threads locked_inc_ms =3D run_threads(nthreads, locked_inc, ops); atomic_add_ms =3D run_threads(nthreads, atomic_add, ops); locked_inc_ms /=3D nthreads; atomic_add_ms /=3D nthreads; printf("thr ms %8.u %8.u\n", locked_inc_ms, atomic_add_ms); locked_inc_ops_p_ms =3D 1.f * locked_inc_ms / ops; atomic_add_ops_p_ms =3D 1.f * atomic_add_ms / ops; printf("thr ms/op %8.8f %8.8f %.4f%%\n",=20 locked_inc_ops_p_ms, atomic_add_ops_p_ms, 100.f * (locked_inc_ops_p_ms - atomic_add_ops_p_ms) / locked_inc_ops_p_ms); return 0; }