river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patricia Shanahan <p...@acm.org>
Subject Re: Progress, and a problem
Date Mon, 13 Dec 2010 19:09:25 GMT
In fairness to the implementation, it is a tuned test, designed and  
tweaked to fail quickly. I started from a single failure of an  
outrigger stress test that almost always passes. This program has a  
couple of hundred threads hitting a FastList continuously with the mix  
of actions it would see, at a much lower rate and with fewer threads,  
in the QA test.

Yes, splitting the list should be considered.


On Dec 13, 2010, at 10:34, Gregg Wonderly <gregg@wonderly.org> wrote:

> This does fail fairly quickly (immediately) on my windows laptop.
> I am not sure that I have time to really look over this code.  I  
> wonder if anyone knows if this is relatively new code that John put  
> together as part of the effort to remove the use of PSE from  
> outrigger, or is the original "non-persistent" javaspaces  
> implementation?
> Perhaps we need to do something different here, a segmented list for  
> example, which is what PSE did with it's Vector implementation so  
> that segments of the list could be locked independently, as well as  
> allowing the segments to be "deleted" from disk once they were  
> "empty".
> Gregg
> On 12/13/2010 12:16 AM, Patricia Shanahan wrote:
>> On 12/12/2010 5:48 PM, Peter Firmstone wrote:
>>> Patricia Shanahan wrote:
>>>> On 12/3/2010 7:15 AM, Gregg Wonderly wrote:
>>>> ...
>>>> > The important issue in FastList is that it was written with the  
>>>> JDK1.4
>>>> > memory model. After moving River to Java 1.5, we'd have the  
>>>> JSR166 work
>>>> > and the new, consistent memory model where volatile has a true  
>>>> meaning.
>>>> > However, this code in particular is quite complex as you have  
>>>> noted, so
>>>> > even adjusting to the new memory model could be problematic.
>>>> I've just run a modified, simplified version of my test with java
>>>> version "1.4.2_19" and an unmodified copy of FastList, and I  
>>>> still get
>>>> the NullPointerException. This changes my thinking a bit. I had  
>>>> been
>>>> working from the assumption that the issue was to do with the  
>>>> changes
>>>> in memory model between 1.4 and 1.5. I now have to consider the
>>>> possibility of a more basic bug that is independent of the memory  
>>>> model.
>>>> If there is anyone with a FastList or Java memory model  
>>>> background who
>>>> would like to help, please reply. I would welcome another set of  
>>>> eyes
>>>> on the code, and a cross check on my conclusions so far about how
>>>> FastList is supposed to work. There seems to be a critical  
>>>> invariant
>>>> that gets broken, and once that happens we are on track to either a
>>>> NullPointerException or dropped items.
>>>> I can supply my test as a unit test (JDK 1.6, Junit 4) and as a  
>>>> main
>>>> program (JDK 1.4 or later0. In both forms, all it does is fire up a
>>>> mixture of threads that repeatedly add items to a FastList and  
>>>> threads
>>>> that repeatedly remove the first item they can from the FastList.
>>>> Failures seem to require simultaneous adds and removes.
>>>> If I don't nail this problem fairly soon, I may abandon the  
>>>> current,
>>>> rather complicated, code and switch to writing a concurrent high
>>>> performance FastList substitute for 1.5 or later.
>>>> Patricia
>>> I'll have a look tonight, no promises though ;)
>> I'm attaching the simplified test application main program that can  
>> run, and
>> fail, under JRE 1.4, with no need for Junit.
>> Here's what I think I know. First of all, I have found some dubious
>> synchronization situations. However, fixing all the things I have  
>> so far found
>> of that type has only reduced the failure rate, not eliminated  
>> failures. That
>> could be caused by changing timings without having any impact on  
>> the root cause.
>> The key invariant relates to a thread that is doing a scan,  
>> starting with a call
>> to head() and proceeding through a series of calls to next() to  
>> examine nodes.
>> The head() call sets up a guard node for the thread that was the  
>> tail at some
>> point during the head call. The invariant is that the series of next 
>> () calls
>> will reach the guard node before finding a null next pointer,  
>> indicating the
>> actual tail.
>> The remove call does not really remove anything, it merely marks  
>> the node
>> removed. Removed nodes are unlinked as a side effect of scans,  
>> during head and
>> next calls, but only if they are not guard nodes.
>> There are additional complications in the restart and reap methods,  
>> but we can
>> ignore them for now - my test does not use them.
>> Once a guard node is lost, the synchronization breaks down  
>> completely, because
>> e.g. insertion at tail is protected by synchronization on the  
>> FastList instance,
>> but unlinking of a removed node in the middle is protected, to the  
>> extent it is
>> protected at all, by synchronization on the FastList.Node instance  
>> that is being
>> removed.
>> The commonest failure symptom is a scan reaching the null next  
>> pointer at the
>> end of the FIFO during head(), without first finding the guard node  
>> it just set
>> up. An alternative form of failure is loss of some entries - they  
>> get added, but
>> the remove threads never find them. The second symptom is  
>> predominant in the
>> JavaSpaces stress test that got me started on this. Messing up a  
>> next pointer
>> could cause either.
>> Incidentally, I'm curious about why it has such a fragile system in  
>> which the
>> state of a scan is partly tracked by thread, when it seems like an  
>> obvious
>> candidate for the Iterator pattern. Callers do need to be able to  
>> find out if a
>> remove call succeeded or not (the node may have been removed by  
>> another thread),
>> but that could be done in an interface extending Iterator. The  
>> WeakHashMap in a
>> node that keeps track of the threads for which it is a guard would  
>> instead track
>> the Iterator. There would be no need for thread local storage, the  
>> same data
>> could be kept in the Iterator.
>> Thanks for any time you can spend looking at this.
>> Patricia

View raw message