river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregg Wonderly <gr...@wonderly.org>
Subject Re: Progress, and a problem
Date Mon, 13 Dec 2010 18:34:18 GMT
This does fail fairly quickly (immediately) on my windows laptop.

I am not sure that I have time to really look over this code.  I wonder if 
anyone knows if this is relatively new code that John put together as part of 
the effort to remove the use of PSE from outrigger, or is the original 
"non-persistent" javaspaces implementation?

Perhaps we need to do something different here, a segmented list for example, 
which is what PSE did with it's Vector implementation so that segments of the 
list could be locked independently, as well as allowing the segments to be 
"deleted" from disk once they were "empty".

Gregg

On 12/13/2010 12:16 AM, Patricia Shanahan wrote:
> On 12/12/2010 5:48 PM, Peter Firmstone wrote:
>> Patricia Shanahan wrote:
>>> On 12/3/2010 7:15 AM, Gregg Wonderly wrote:
>>> ...
>>> > The important issue in FastList is that it was written with the JDK1.4
>>> > memory model. After moving River to Java 1.5, we'd have the JSR166 work
>>> > and the new, consistent memory model where volatile has a true meaning.
>>> > However, this code in particular is quite complex as you have noted, so
>>> > even adjusting to the new memory model could be problematic.
>>>
>>> I've just run a modified, simplified version of my test with java
>>> version "1.4.2_19" and an unmodified copy of FastList, and I still get
>>> the NullPointerException. This changes my thinking a bit. I had been
>>> working from the assumption that the issue was to do with the changes
>>> in memory model between 1.4 and 1.5. I now have to consider the
>>> possibility of a more basic bug that is independent of the memory model.
>>>
>>> If there is anyone with a FastList or Java memory model background who
>>> would like to help, please reply. I would welcome another set of eyes
>>> on the code, and a cross check on my conclusions so far about how
>>> FastList is supposed to work. There seems to be a critical invariant
>>> that gets broken, and once that happens we are on track to either a
>>> NullPointerException or dropped items.
>>>
>>> I can supply my test as a unit test (JDK 1.6, Junit 4) and as a main
>>> program (JDK 1.4 or later0. In both forms, all it does is fire up a
>>> mixture of threads that repeatedly add items to a FastList and threads
>>> that repeatedly remove the first item they can from the FastList.
>>> Failures seem to require simultaneous adds and removes.
>>>
>>> If I don't nail this problem fairly soon, I may abandon the current,
>>> rather complicated, code and switch to writing a concurrent high
>>> performance FastList substitute for 1.5 or later.
>>>
>>> Patricia
>>>
>> I'll have a look tonight, no promises though ;)
>
> I'm attaching the simplified test application main program that can run, and
> fail, under JRE 1.4, with no need for Junit.
>
> Here's what I think I know. First of all, I have found some dubious
> synchronization situations. However, fixing all the things I have so far found
> of that type has only reduced the failure rate, not eliminated failures. That
> could be caused by changing timings without having any impact on the root cause.
>
> The key invariant relates to a thread that is doing a scan, starting with a call
> to head() and proceeding through a series of calls to next() to examine nodes.
> The head() call sets up a guard node for the thread that was the tail at some
> point during the head call. The invariant is that the series of next() calls
> will reach the guard node before finding a null next pointer, indicating the
> actual tail.
>
> The remove call does not really remove anything, it merely marks the node
> removed. Removed nodes are unlinked as a side effect of scans, during head and
> next calls, but only if they are not guard nodes.
>
> There are additional complications in the restart and reap methods, but we can
> ignore them for now - my test does not use them.
>
> Once a guard node is lost, the synchronization breaks down completely, because
> e.g. insertion at tail is protected by synchronization on the FastList instance,
> but unlinking of a removed node in the middle is protected, to the extent it is
> protected at all, by synchronization on the FastList.Node instance that is being
> removed.
>
> The commonest failure symptom is a scan reaching the null next pointer at the
> end of the FIFO during head(), without first finding the guard node it just set
> up. An alternative form of failure is loss of some entries - they get added, but
> the remove threads never find them. The second symptom is predominant in the
> JavaSpaces stress test that got me started on this. Messing up a next pointer
> could cause either.
>
> Incidentally, I'm curious about why it has such a fragile system in which the
> state of a scan is partly tracked by thread, when it seems like an obvious
> candidate for the Iterator pattern. Callers do need to be able to find out if a
> remove call succeeded or not (the node may have been removed by another thread),
> but that could be done in an interface extending Iterator. The WeakHashMap in a
> node that keeps track of the threads for which it is a guard would instead track
> the Iterator. There would be no need for thread local storage, the same data
> could be kept in the Iterator.
>
> Thanks for any time you can spend looking at this.
>
> Patricia
>
>


Mime
View raw message