Return-Path: X-Original-To: apmail-river-dev-archive@www.apache.org Delivered-To: apmail-river-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2FD75105D1 for ; Wed, 10 Apr 2013 16:00:59 +0000 (UTC) Received: (qmail 56775 invoked by uid 500); 10 Apr 2013 16:00:59 -0000 Delivered-To: apmail-river-dev-archive@river.apache.org Received: (qmail 56706 invoked by uid 500); 10 Apr 2013 16:00:58 -0000 Mailing-List: contact dev-help@river.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@river.apache.org Delivered-To: mailing list dev@river.apache.org Received: (qmail 56692 invoked by uid 99); 10 Apr 2013 16:00:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2013 16:00:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gergg@cox.net designates 68.230.241.217 as permitted sender) Received: from [68.230.241.217] (HELO eastrmfepo202.cox.net) (68.230.241.217) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Apr 2013 16:00:52 +0000 Received: from eastrmimpo109 ([68.230.241.222]) by eastrmfepo202.cox.net (InterMail vM.8.01.05.09 201-2260-151-124-20120717) with ESMTP id <20130410160031.QXQB31447.eastrmfepo202.cox.net@eastrmimpo109> for ; Wed, 10 Apr 2013 12:00:31 -0400 Received: from [192.168.20.173] ([76.76.133.74]) by eastrmimpo109 with cox id NG0W1l00R1cUBck01G0WEV; Wed, 10 Apr 2013 12:00:30 -0400 X-CT-Class: Clean X-CT-Score: 0.00 X-CT-RefID: str=0001.0A02020A.51658C9F.0025,ss=1,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=2.0 cv=XpUkzy59 c=1 sm=1 a=heBF41qg0YAiVTIN72KKpA==:17 a=fGfkcMrUG08A:10 a=G8Uczd0VNMoA:10 a=kj9zAlcOel0A:10 a=kviXuzpPAAAA:8 a=EyQ97BPlIK8A:10 a=0LiwH3idAAAA:8 a=hvdGri13ol2WVKF8IgQA:9 a=CjuIK1q_8ugA:10 a=CilXOqDx1BcZq7Vr:21 a=y7aULUC-gF6YTwdp:21 a=heBF41qg0YAiVTIN72KKpA==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; auth=pass (PLAIN) smtp.auth=gergg@cox.net Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: Next steps after 2.2.1 release From: Gregg Wonderly X-Priority: 3 In-Reply-To: <1365452351.2489.21.camel@Nokia-N900-51-1> Date: Wed, 10 Apr 2013 11:00:29 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <96B42E47-E2C9-40B4-A776-0EF2E827BD15@cox.net> References: <5159BD5E.3070709@acm.org> <515A52C3.2060403@acm.org> <515A860F.3070603@zeus.net.au> <515A8AEF.5090102@zeus.net.au> <68139229-B637-4B8C-A174-DFBA996E596F@gmail.com> <515AC61D.2000003@zeus.net.au> <36D91857-5A31-4257-B85F-619F5DCD724E@gmail.com> <1365023400.4791.11.camel@Nokia-N900-51-1> <39E52F0A-5F67-4E61-90EF-EFCF8CFF23CF@gmail.com> <2ACB340C-5F97-4A64-BC75-88341D5A0534@trasuk.com> <3514D0F7-A46B-4BE6-AD0F-31972AB66C91@trasuk.com> <1365371691.8466.11.camel@Nokia-N900-51-1> <1365379389.13775.314.camel@cameron> <5162C1E6.6030004@wonderly.org> <1365452351.2489.21.camel@Nokia-N900-51-1> To: dev@river.apache.org, Peter X-Mailer: Apple Mail (2.1503) X-Virus-Checked: Checked by ClamAV on apache.org I just want to extend this conversation a bit by saying that nearly = everything about River is "concurrently accessed". There are, of course = several places, where work is done by one thread, at a time, but new = threads are created to do that work, and that means that "visibility" = has to be considered. I won't say that every single field in every class in River needs to be = final or volatile, but that should not be considered an extreme. = Specifically, you might see code execute just fine without appropriate = concurrency design, and then it will suddenly break when a new = optimization appears on the scene, reordering something under the covers = and creating an intangible behavior. Some "visibility bugs" might not = ever manifest because of other "happens before" and "cache line sync" = activities that happen implicitly based on the "current design" or = "thread model". We can "be happy" with "it ain't broke, so don't fix = it", but I don't think that's very productive. I personally, have been beating on various parts of Jini in my "fork" = because of completely unpredictable results in discovery and discovery = management. I've written, rewritten, debugged and stared at that code = till I was blue in the face, because my ServiceUI desktop application = just doesn't behave like it should. Some of it is missing lifecycle = management that was not in the original services, because = System.registerShutdownHook() hasn't been used. But other parts are = these race conditions and thread scheduling overlaps (or underlaps) = which keep discovery and notification from happening reliably. There = are lots of different reasons why people might not be "complaining" = about this stuff, but I would contend that the fact that there are many = examples of people forking and extending Jini, which to me, reflects the = fact that there are things that aren't correct, or functional in the = wild, and this causes them to jump over the cliff and never look back. We are at that point today, and Peter's continued slogging through the = motions to track down and discover where the issues actually are, is an = astronomical effort! I have been very involved in several different, = new work opportunities that have kept me from jumping in to participate = in dealing with all of these issues, as I have really wanted to. =20 Gregg Wonderly On Apr 8, 2013, at 3:19 PM, Peter wrote: > Thanks Gregg, >=20 > You've hit the nail on the head, this is exactly the issue I'm having. >=20 > So I've been fixing safe publication in constructors by making fields = final or volatile and ensuring "this" doesn't escape, fixing = synchronisation on collections etc during method calls. >=20 > To fix deadlock, I investigate immutable non blocking data structures = with volatile publication, if future state doesn't depend on previous = state, if it does a CAS atomic reference can be used instead of = volatile. >=20 > Often i find synchronization is quite acceptable if it is limited in = scope, if synchronized or holding a lock while a thread is executing = outside your objects scope of control, that's when deadlock is more = likely to occur. >=20 > The polciy providers were deadlock prone, which is why they're mostly = immutable non blocking now, any synchronization or locking is limited. >=20 > I basically follow Doug Lea's concurrency in practise guidelines. >=20 > For debugging I follow Cliff Click's reccommendations. >=20 > Unfortunately fixing concurrency bugs means finding a trace of = execution, identifying all classes and inspecting the code visually. = Findbugs identifies cases of inadequate sychronization using static = analysis. >=20 > Regards, >=20 > Peter. >=20 > ----- Original message ----- >> On 4/7/2013 7:03 PM, Greg Trasuk wrote: >>> I'm honestly and truly not passing judgement on the quality of the = code. I >>> honestly don't know if it's good or bad. I have to confess that, = given that >>> Jini was written as a top-level project at Sun, sponsored by Bill = Joy, when >>> Sun was at the top of its game, and the Jini project team was a = "who's-who" of >>> distributed computing pioneers, the idea that it's riddled with = concurrency >>> bugs surprises me. But mainly, I'm still trying to answer that = question - "How >>> do I know if it's good?" Here's what I'm doing: - I'm attempting to = run the >>> tests from "tags/2.2.0" against the "2.2" branch. When I have = confidence in >>> the "2.2" branch, I'll publish the results, ask anyone else who's = interested >>> to test it, and then call for a release on "2.2.1" - After that, the >>> developers need to reach consensus about how to move forward. = Cheers, Greg. >>=20 >> This is an important issue to address. I know a lot of people here = probably >> don't participate on the Concurrency-interest mailing list that has a = wide range >> of discussion about the JLS vs the JMM and what the JIT compilers = actually do to >> code these days. >>=20 >> The number one issue that you need to understand, is that the = optimizer is >> working against you more and more these days if you don't have JMM = details >> exactly write. Statements are being reordered more and more, = including actual >> "assignments" which can expose uninitialized data items in "racy" = concurrent >> code. The latest example is the Thread.setName()/Thread.getName() = pair. They >> are most likely always to be accessed by "other threads", yet there = is no >> synchronization on them, including no "visibility" control with = volatile even.=20 >> What this means, is that if setName() and getName() are being called = in a racy >> environment, the setName, will assign the array that is created to = copy the >> characters into, before the arraycopy of the data occurs, potentially = exposing >> an uninitialized name to getName(). >>=20 >> There are literally hundreds of places in the JDK that still have = these kinds of >> races going on, and no one at Oracle, based on how people are acting, = appears to >> be responsible for dealing with it. The Jini code, has many many of = the same >> issues that just randomly appear in stress cases on "slower" or = "faster" >> hardware, depending on the issue. >>=20 >> When you haven't got sharing and visibility covered correctly, the = JIT code >> rewrites can make execution order play a big part in conflating what = you "see" >> happening verses what the "code" says, to you, should happen. >>=20 >> There are some very simple things to get the JIT out of the picture. = One of >> these, is to actually open the source up in an IDE and declare every = field >> final. If that doesn't work due to 'mutation' of values, change = those fields to >> 'volatile' so that it will compile again. Then run your tests and = you will now >> greatly diminish reordering and visibility issues so that you can = just get to >> the simple "was it set correctly, before it was read" and "did we = provide the >> correct atomicity for that update" kinds of questions that will help = you >> understand things better when code is misbehaving. >>=20 >> This is the kind of thing that Peter has been working through because = the usage >> of the code in real life has not continued in the same way that it = did when the >> code was written, and the JMM in JDK5 has literally broken so much = software, all >> over the planet, that used to work quite well, because there wasn't a = formal >> definition of "happens before". Now that there is, the compiler = optimizations >> are against you if you don't get it right. The behaviors you will = experience, >> because of reorderings that are targeted at all out performance = (minimize >> traffic in and out of the CPU through memory subsystems), can create = completely >> unexpected results. Intra-thread semantics are kept correct, but = inter-thread >> execution will just seem intangible because stuff will not be = happening in the >> order the "code" says it should. >>=20 >> Gregg Wonderly >>=20 >=20