Mailing-List: contact dev-help@stdcxx.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@stdcxx.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <50566461.4080101@hates.ms>
Date: Sun, 16 Sep 2012 19:44:33 -0400
From: Liviu Nicoara <nikkoara@hates.ms>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: dev@stdcxx.apache.org
Subject: Re: STDCXX-1056 [was: Re: STDCXX forks]
References: <40394653-8FCC-4D04-A108-2C650AF8F95B@hates.ms>
 <5049016D.8000902@gmail.com>
 <CALdE9OAXm7UsjR+O8TsAHUa6h1cxkiA-mq7YN5asg3Ci2mw4kA@mail.gmail.com>
 <CALdE9ODiPtTK8UYe+rKgfcBf350-WkiNu9LUgE=gM6313JLLYQ@mail.gmail.com>
 <504E2440.1060706@gmail.com>
 <CALdE9OBED+0p+JzkN=Zqk+VSFvj4nv6VN6AE1o7E817CeL7KUQ@mail.gmail.com>
 <CALdE9OBhD8GLeQP8Z5F8Lo8DtxYXfBMAP0ZjQzxqB32YnsnJ4g@mail.gmail.com>
 <504FE7F9.90102@gmail.com> <504FF10E.8020506@hates.ms>
 <CALdE9OB9XPF6f+nH9S-=9NAE0Fj2D44JmXudD00-BXrk0=bU3w@mail.gmail.com>
 <50508849.4070708@hates.ms>
 <CALdE9OA-nNDcgzb7evuYuX2m_1CS7VW9krdaLO64v5OfHBRGcQ@mail.gmail.com>
 <CALdE9OC4y-MsNpjBtYMxoxuc-U3JLGcVAwaeDUWXa_18zauMAQ@mail.gmail.com>
 <50547C22.1000604@hates.ms>
 <CALdE9OBXbd5SKWSUtc_5_GznSDmruZuwLxJ2Xeoc1YZ3F_UdPA@mail.gmail.com>
 <5054EAB7.5050609@hates.ms>
 <CALdE9OChNHO9FCDntmGfoEOC0bVgE=QpcGJ+GSTJGJxUK4981Q@mail.gmail.com>
 <5055EE63.6090004@hates.ms>
 <CALdE9OD3vLnh-VoeBva9rdA-my19LTMPhofrUdQcr-CnecGzkw@mail.gmail.com>
In-Reply-To: 
 <CALdE9OD3vLnh-VoeBva9rdA-my19LTMPhofrUdQcr-CnecGzkw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 9/16/12 3:20 AM, Stefan Teleman wrote:
 > On Sat, Sep 15, 2012 at 4:53 PM, Liviu Nicoara <nikkoara@hates.ms> wrote:
 >
 >> Now, to clear the confusion I created: the timing numbers I posted in the
 >> attachment stdcxx-1056-timings.tgz to STDCXX-1066 (09/11/2012) showed that a
 >> perfectly forwarding, no caching public interface (exemplified by a changed
 >> grouping) performs better than the current implementation. It was that test
 >> case that I hoped you could time, perhaps on SPARC, in both MT and ST
 >> builds. The t.cpp program is for MT, s.cpp for ST.
 >
 > I got your patch, and have tested it.
 >
 > I have created two Experiments (that's what they are called) with the
 > SunPro Performance Analyzer. Both experiments are targeting race
 > conditions and deadlocks in the instrumented program,  and both
 > experiments are running the 22.locale.numpunct.mt program from the
 > stdcxx test harness. One experiment is with  your patch applied. The
 > other experiment is with our (Solaris) patch applied.
 >
 > Here are the results:

I looked at the analysis more closely.

 >
 > 1. with your patch applied:
 >
 > http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.nts/

I see here (http://tinyurl.com/94pbmzc) that the implementation of the facet 
public interface is forwarding, with no caching.

 >
 > 2. with our (Solaris) patch applied:
 >
 > http://s247136804.onlinehome.us/22.locale.numpunct.mt.1.er.ts/

Unfortunately, can't do the same here. Could you please refresh my memory what 
does the patch contain? This patch is not part of the patch set you published 
here earlier (http://tinyurl.com/8pyql4g)?

AFAICT, the race accesses that the analyzer points out are writes to shared 
locations which occur along the thread execution path. They do not necessarily 
mean that a race condition exists, and in fact we know that no race condition 
exists if the public facet interface forwards to the protected virtual 
interface. Which is what was tested in the first analysis, looking at 
_numpunct.h: http://tinyurl.com/94pbmzc

Looking elsewhere, also in the first analysis, the __rw_get_numpunct function 
(src link points here: http://tinyurl.com/8ez85e2). All highlighted lines, each 
performing a write to shared locations, are potential race points, but do not 
lead to race conditions because of the proper synchronization we know occurs in 
the __rw_setlocale class.

The number of race accesses in __rw_get_numpunct sums up to ~3400 race accesses, 
with a forwarding patch. That you pointed out in a later email. That number was 
a bit puzzling, but then looking at the thread function I see the test uses the 
numpunct test suite code, which creates a locale and extracts the facet from it 
in each iteration.

That means that, ideally, for 4 threads iterating 10000 times, I would expect 
locales being created 40K times, and so for the facets and so for the 
__rw_get_numpunct calls, etc. The number or race accesses collected, far less 
than that, could be explained by a lesser degree of thread overlapping? I.e., 
some threads start earlier, others later, and only partially overlap?

If that is the case I would not ascribe much importance to these numbers. As I 
think was pointed out earlier, a numpunct facet is initialized at the first trip 
in __rw_get_numpunct and that trip is (only then) properly synchronized. All 
subsequent trips in __rw_get_numpunct find the facet data already there and they 
just read it, no synchronization needed, and return it. Therefore, the cost of 
initialization/synchronization is paid only once.

Thanks.

Liviu