Mailing-List: contact dev-help@river.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@river.apache.org
Received-SPF: pass (athena.apache.org: domain of gergg@cox.net designates
 68.230.241.217 as permitted sender)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: Next steps after 2.2.1 release
From: Gregg Wonderly <gergg@cox.net>
In-Reply-To: <1365452351.2489.21.camel@Nokia-N900-51-1>
Date: Wed, 10 Apr 2013 11:00:29 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <96B42E47-E2C9-40B4-A776-0EF2E827BD15@cox.net>
References: <CD7EEBB5.54F13%bryan@systap.com> <5159BD5E.3070709@acm.org>
 <CAMPUXz_nvvOGC49NYVkHnuAY9AZppUCA=ffEcd-xXR6JE_jdQw@mail.gmail.com>
 <515A52C3.2060403@acm.org> <515A860F.3070603@zeus.net.au>
 <515A8AEF.5090102@zeus.net.au>
 <68139229-B637-4B8C-A174-DFBA996E596F@gmail.com>
 <515AC61D.2000003@zeus.net.au>
 <36D91857-5A31-4257-B85F-619F5DCD724E@gmail.com>
 <1365023400.4791.11.camel@Nokia-N900-51-1>
 <CAMPUXz8cQ5wX+qsKJYmiu8+=37-m5n-ovuryfbQqR6Hm9uZP3w@mail.gmail.com>
 <39E52F0A-5F67-4E61-90EF-EFCF8CFF23CF@gmail.com>
 <CAMPUXz_FjrrUjR5TVAn7WW8hhLrCd7uNTPg3Gk-h_uPm2UbHHQ@mail.gmail.com>
 <2ACB340C-5F97-4A64-BC75-88341D5A0534@trasuk.com>
 <3514D0F7-A46B-4BE6-AD0F-31972AB66C91@trasuk.com>
 <1365371691.8466.11.camel@Nokia-N900-51-1>
 <1365379389.13775.314.camel@cameron>  <5162C1E6.6030004@wonderly.org>
 <1365452351.2489.21.camel@Nokia-N900-51-1>
To: dev@river.apache.org,
 Peter <jini@zeus.net.au>

I just want to extend this conversation a bit by saying that nearly =
everything about River is "concurrently accessed".  There are, of course =
several places, where work is done by one thread, at a time, but new =
threads are created to do that work, and that means that "visibility" =
has to be considered.

I won't say that every single field in every class in River needs to be =
final or volatile, but that should not be considered an extreme.  =
Specifically, you might see code execute just fine without appropriate =
concurrency design, and then it will suddenly break when a new =
optimization appears on the scene, reordering something under the covers =
and creating an intangible behavior.  Some "visibility bugs" might not =
ever manifest because of other "happens before" and "cache line sync" =
activities that happen implicitly based on the "current design" or =
"thread model".  We can "be happy" with "it ain't broke, so don't fix =
it", but I don't think that's very productive.

I personally, have been beating on various parts of Jini in my "fork" =
because of completely unpredictable results in discovery and discovery =
management.  I've written, rewritten, debugged and stared at that code =
till I was blue in the face, because my ServiceUI desktop application =
just doesn't behave like it should.  Some of it is missing lifecycle =
management that was not in the original services, because =
System.registerShutdownHook() hasn't been used.  But other parts are =
these race conditions and thread scheduling overlaps (or underlaps) =
which keep discovery and notification from happening reliably.   There =
are lots of different reasons why people might not be "complaining" =
about this stuff, but I would contend that the fact that there are many =
examples of people forking and extending Jini, which to me, reflects the =
fact that there are things that aren't correct, or functional in the =
wild, and this causes them to jump over the cliff and never look back.

We are at that point today, and Peter's continued slogging through the =
motions to track down and discover where the issues actually are, is an =
astronomical effort!  I have been very involved in several different, =
new work opportunities that have kept me from jumping in to participate =
in dealing with all of these issues, as I have really wanted to. =20

Gregg Wonderly

On Apr 8, 2013, at 3:19 PM, Peter <jini@zeus.net.au> wrote:

> Thanks Gregg,
>=20
> You've hit the nail on the head, this is exactly the issue I'm having.
>=20
> So I've been fixing safe publication in constructors by making fields =
final or volatile and ensuring "this" doesn't escape, fixing =
synchronisation on collections etc during method calls.
>=20
> To fix deadlock, I investigate immutable non blocking data structures =
with volatile publication, if future state doesn't depend on previous =
state, if it does a CAS atomic reference can be used instead of =
volatile.
>=20
> Often i find synchronization is quite acceptable if it is limited in =
scope, if synchronized or holding a lock while a thread is executing =
outside your objects scope of control, that's when deadlock is more =
likely to occur.
>=20
> The polciy providers were deadlock prone, which is why they're mostly =
immutable non blocking now, any synchronization or locking is limited.
>=20
> I basically follow Doug Lea's concurrency in practise guidelines.
>=20
> For debugging I follow Cliff Click's reccommendations.
>=20
> Unfortunately fixing concurrency bugs means finding a trace of =
execution, identifying all classes and inspecting the code visually.  =
Findbugs identifies cases of inadequate sychronization using static =
analysis.
>=20
> Regards,
>=20
> Peter.
>=20
> ----- Original message -----
>> On 4/7/2013 7:03 PM, Greg Trasuk wrote:
>>> I'm honestly and truly not passing judgement on the quality of the =
code. I
>>> honestly don't know if it's good or bad. I have to confess that, =
given that
>>> Jini was written as a top-level project at Sun, sponsored by Bill =
Joy, when
>>> Sun was at the top of its game, and the Jini project team was a =
"who's-who" of
>>> distributed computing pioneers, the idea that it's riddled with =
concurrency
>>> bugs surprises me. But mainly, I'm still trying to answer that =
question - "How
>>> do I know if it's good?" Here's what I'm doing: - I'm attempting to =
run the
>>> tests from "tags/2.2.0" against the "2.2" branch. When I have =
confidence in
>>> the "2.2" branch, I'll publish the results, ask anyone else who's =
interested
>>> to test it, and then call for a release on "2.2.1" - After that, the
>>> developers need to reach consensus about how to move forward. =
Cheers, Greg.
>>=20
>> This is an important issue to address.  I know a lot of people here =
probably
>> don't participate on the Concurrency-interest mailing list that has a =
wide range
>> of discussion about the JLS vs the JMM and what the JIT compilers =
actually do to
>> code these days.
>>=20
>> The number one issue that you need to understand, is that the =
optimizer is
>> working against you more and more these days if you don't have JMM =
details
>> exactly write.  Statements are being reordered more and more, =
including actual
>> "assignments" which can expose uninitialized data items in "racy" =
concurrent
>> code.  The latest example is the  Thread.setName()/Thread.getName() =
pair.  They
>> are most likely always to be accessed by "other threads", yet there =
is no
>> synchronization on them, including no "visibility" control with =
volatile even.=20
>> What this means, is that if setName() and getName() are being called =
in a racy
>> environment, the setName, will assign the array that is created to =
copy the
>> characters into, before the arraycopy of the data occurs, potentially =
exposing
>> an uninitialized name to getName().
>>=20
>> There are literally hundreds of places in the JDK that still have =
these kinds of
>> races going on, and no one at Oracle, based on how people are acting, =
appears to
>> be responsible for dealing with it. The Jini code, has many many of =
the same
>> issues that just randomly appear in stress cases on "slower" or =
"faster"
>> hardware, depending on the issue.
>>=20
>> When you haven't got sharing and visibility covered correctly, the =
JIT code
>> rewrites can make execution order play a big part in conflating what =
you "see"
>> happening verses what the "code" says, to you, should happen.
>>=20
>> There are some very simple things to get the JIT out of the picture.  =
One of
>> these, is to actually open the source up in an IDE and declare every =
field
>> final.  If that doesn't work due to 'mutation' of values, change =
those fields to
>> 'volatile' so that it will compile again.    Then run your tests and =
you will now
>> greatly diminish reordering and visibility issues so that you can =
just get to
>> the simple "was it set correctly, before it was read" and "did we =
provide the
>> correct atomicity for that update" kinds of questions that will help =
you
>> understand things better when code is misbehaving.
>>=20
>> This is the kind of thing that Peter has been working through because =
the usage
>> of the code in real life has not continued in the same way that it =
did when the
>> code was written, and the JMM in JDK5 has literally broken so much =
software, all
>> over the planet, that used to work quite well, because there wasn't a =
formal
>> definition of "happens before".    Now that there is, the compiler =
optimizations
>> are against you if you don't get it right.  The behaviors you will =
experience,
>> because of reorderings that are targeted at all out performance =
(minimize
>> traffic in and out of the CPU through memory subsystems), can create =
completely
>> unexpected results.  Intra-thread semantics are kept correct, but =
inter-thread
>> execution will just seem intangible because stuff will not be =
happening in the
>> order the "code" says it should.
>>=20
>> Gregg Wonderly
>>=20
>=20